Sequential part of the program takes more time as process count increases - c

I'm writing a parallel C program using MPICH. The program naturally has a sequential part and a parallel part. The parallel part seems to be working fine, however I'm having trouble with the sequential part.
In the sequential part, the program reads some values from a file and then proceeds to distribute them among the other processes. It is written as follows:
if (rank == 0)
{
gettimeofday(&sequentialStartTime, NULL);
// Read document id's and their weights into the corresponding variables
documentCount = readDocuments(&weights, &documentIds, documentsFileName, dictionarySize);
readQuery(&query, queryFileName, dictionarySize);
gettimeofday(&endTime, NULL);
timersub(&endTime, &sequentialStartTime, &resultTime);
printf("Sequential part: %.2f ms\n", (double) resultTime.tv_sec + resultTime.tv_usec / 1000.0);
//distribute the data to other processes
} else {
//wait for the data and the start working
}
Here, readQuery and readDocuments are reading values from files, and the elapsed time is printed after they are complete. This piece of code actually works just fine. The problem arises when I try to run this with different number of processors.
I run the program with the following command
mpirun -np p ./main
where p is the processor count. I expect the sequential part to run in a certain amount of time no matter how many processors I use. For p values 1 to 4, this holds, however when I use 5 to 8 as p values, then the sequential part takes more time.
The processor I'm using is Intel® Core™ i7-4790 CPU # 3.60GHz × 8 and the operating system I have is Windows 8.1 64-bit. I'm running this program on Ubuntu 14.04 which runs on a virtual machine which has full access to my processor and 8 GB's of RAM.
The only reason that came to my mind is that maybe when the process count is higher than 4, the main process may share the physical core it's running on with another process, since I know that this CPU has 4 physical cores but functions as if it had 8, using the hyperthreading technology. However, when I increase p from 5 to 6 to 7 and so on, the execution time increases linearly, so that cannot be the case.
Any help or idea on this would be highly appreciated. Thanks in advance.
Edit: I realized that increasing p increases the run time no matter the value. I'm getting a linear increase in time as p increases.

Related

Difference between MPI_Wtime and actual wall time

I implemented a MIMD genetic algorithm using C and OpenMPI where each process takes care of a independent subpopulation (island model). So, for a population of size 200, an 1-process run operates on the whole population while 2 processes evolve populations of size 100.
So, by measuring the execution time with MPI_Wtime, I'm getting the expected execution time by running on a 2-core machine with ubuntu. However, it disagrees with both ubuntu's time command and perception alone: it's noticeable that running with 2 processes takes longer for some reason.
$time mpirun -n 1 genalg
execution time: 0.570039 s (MPI_Wtime)
real 0m0.618s
user 0m0.584s
sys 0m0.024s
$time mpirun -n 2 genalg
execution time: 0.309784 s (MPI_Wtime)
real 0m1.352s
user 0m0.604s
sys 0m0.064s
For a larger population (4000), I get the following:
$time mpirun -n 1 genalg
execution time: 11.645675 s (MPI_Wtime)
real 0m11.751s
user 0m11.292s
sys 0m0.392s
$time mpirun -n 2 genalg
execution time: 5.872798 s (MPI_Wtime)
real 0m8.047s
user 0m11.472s
sys 0m0.380s
I get similar results whether there's communication between the processes or not, and also tried MPI_Barrier. Also got the same results with gettimeofday, and turning gcc optimization on or off doesn't make much difference.
What is possibly going on? It should run faster with 2 processes, like MPI_Wtime suggests, but in reality it's running slower, matching the real time.
Update: I ran it on another PC and didn't have this issue.
The code:
void runGA(int argc,char* argv[])
{
(initializations)
if(MYRANK == 0)
t1 = MPI_Wtime();
genalg();
Individual* ind = best_found();
MPI_Barrier(MPI_COMM_WORLD);
if(MYRANK != 0)
return;
t2 = MPI_Wtime();
exptime = t2-t1;
printf("execution time: %f s\n",exptime);
}
My guess (and her/his) is that time give the sum of the time used by all cores. It's more like a cost : you have 2 processes on 2 cores, so the cost time is time1+time2 because the second core could be used for another process, so you "lost" this time on this second core. MPI_Wtime() display the actual time spend for the human.
It's maybe the explanation why the real time is lower that user time in the second case. The real time is closer to MPI time than the sum of user ans sys. In the 1st case the initialization time take to much time and probably false the result.
The issue was solved after upgrading Ubuntu Mate 15.10 to 16.04, which came with OpenMPI version 1.10.2 (the previous one was 1.6.5).

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

Negative Scaling of MPI_Allgather

I am facing a problem as I am programming a parallel molecular dynamics algorithm in C where all the cores compute the smallest collision time and then communicate the collisionpartners via MPI_Allgather to all other cores to see which collision is the earliest.
I have built in a time measure function to see how the different parts of my program are scaling. This shows that for 8 nodes (192 cores) the Allgather takes 2000 seconds for 100k timesteps while it takes 5000 seconds for 20 nodes (480).
I use the Cray compiler on a Cray system with the following flags:
add_definitions(-DNDEBUG)
set(CMAKE_C_FLAGS "-O3 -h c99,pl=./compiler_information,wp")
set(CMAKE_EXE_LINKER_FLAGS "-h pl=./compiler_information,wp")
and the part of the code looks like this:
MPI_Barrier(cartcomm);
START(scmcdm_Allgather); // time measure
MPI_Allgather(v_min_cpartner, 1, mpi_vector5, min_cpartners, 1, mpi_vector5, cartcomm);
STOP(scmcdm_Allgather); // time measure
where mpi_vector5 is a continuous datatype containing 5 doubles:
MPI_Type_contiguous(5, MPI_DOUBLE, &mpi_vector5);
Is this normal behavior? How do I optimize this?
UPDATE:
Thanks for your comments, I implemented 2 other ways of solving the problem:
All cores first send an integer value if they actually have a collision in the given timestep (only a few will) and then only the cores that have a collision communicate it to core 0 which then broadcasts the minimum.
Here the first step is slow where all cores communicate with core 0. Is there any possibility in MPI to skip this step and have a collective communication routine where only part of the core participate in? (namely the ones that have a minimum)
Instead of communcating the vector5 I used a double_int pair with the collisiontime and the rank to use the minloc function. The core with the minimum collision time then broadcasts the vector5.
This solution is the fastest so far, but still it scales negatively (1600s on 8 nodes, 3000s on 20 nodes).
Any other idea?
Addressing the issue of the original question:
Keep your MPI simple. Avoid datatypes if you have something like sending 5 doubles. So if you are sending 5 doubles, say so: MPI_Allgather(v_min_cpartner, 5, MPI_DOUBLE, min_cpartners, 5, MPI_DOUBLE, cartcomm)
The reason for this is that unfortunately the case for user-defined MPI datatypes may not be always optimized.
I assume that you are printing when the operation has finished in rank 0. This is a way, but it may give you an incomplete picture. The best way is to time for each rank and then run an MPI_Allreduce to find the max value of the time reported by all ranks.
Collective operations finish in different times on different ranks and MPI implementations change algorithms internally depending on number of ranks, payload, size of the machine allocation etc. Counting the time on a specific rank makes sense if that rank is on your critical path of an MPMD program; if you are executing and SPMD program you will have to wait for the rank that took the longest in your next collective operation or synchronization point.

QueryPerformance counter and Queryperformance frequency in windows

#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// assuming we return times with microsecond resolution
#define STOPWATCH_TICKS_PER_US 1
uint64_t GetStopWatch()
{
LARGE_INTEGER t, freq;
uint64_t val;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&freq);
return (uint64_t) (t.QuadPart / (double) freq.QuadPart * 1000000);
}
void task()
{
printf("hi\n");
}
int main()
{
uint64_t start = GetStopWatch();
task();
uint64_t stop = GetStopWatch();
printf("Elapsed time (microseconds): %lld\n", stop - start);
}
The above contains a query performance counter function Retrieves the current value of the high-resolution performance counter and query performance frequency function Retrieves the frequency of the high-resolution performance counter. If I am calling the task(); function multiple times then the difference between the start and stop time varies but I should get the same time difference for calling the task function multiple times. could anyone help me to identify the mistake in the above code ??
The thing is, Windows is a pre-emptive multi-tasking operating system. What the hell does that mean, you ask?
'Simple' - windows allocates time-slices to each of the running processes in the system. This gives the illusion of dozens or hundreds of processes running in parallel. In reality, you are limited to 2, 4, 8 or perhaps 16 parallel processes in a typical desktop/laptop. An Intel i3 has 2 physical cores, each of which can give the impression of doing two things at once. (But in reality, there's hardware tricks going on that switch the execution between each of the two threads that each core can handle at once) This is in addition to the software context switching that Windows/Linux/MacOSX do.
These time-slices are not guaranteed to be of the same duration each time. You may find the pc does a sync with windows.time to update your clock, you may find that the virus-scanner decides to begin working, or any one of a number of other things. All of these events may occur after your task() function has begun, yet before it ends.
In the DOS days, you'd get very nearly the same result each and every time you timed a single iteration of task(). Though, thanks to TSR programs, you could still find an interrupt was fired and some machine-time stolen during execution.
It is for just these reasons that a more accurate determination of the time a task takes to execute may be calculated by running the task N times, dividing the elapsed time by N to get the time per iteration.
For some functions in the past, I have used values for N as large as 100 million.
EDIT: A short snippet.
LARGE_INTEGER tStart, tEnd;
LARGE_INTEGER tFreq;
double tSecsElapsed;
QueryPerformanceFrequency(&tFreq);
QueryPerformanceCounter(&tStart);
int i, n = 100;
for (i=0; i<n; i++)
{
// Do Something
}
QueryPerformanceCounter(&tEnd);
tSecsElapsed = (tEnd.QuadPart - tStart.QuadPart) / (double)tFreq.QuadPart;
double tMsElapsed = tSecElapsed * 1000;
double tMsPerIteration = tMsElapsed / (double)n;
Code execution time on modern operating systems and processors is very unpredictable. There is no scenario where you can be sure that the elapsed time actually measured the time taken by your code, your program may well have lost the processor to another process while it was executing. The caches used by the processor play a big role, code is always a lot slower when it is executed the first time when the caches do not yet contain the code and data used by the program. The memory bus is very slow compared to the processor.
It gets especially meaningless when you measure a printf() statement. The console window is owned by another process so there's a significant chunk of process interop overhead whose execution time critically depends on the state of that process. You'll suddenly see a huge difference when the console window needs to be scrolled for example. And most of all, there isn't actually anything you can do about making it faster so measuring it is only interesting for curiosity.
Profile only code that you can improve. Take many samples so you can get rid of the outliers. Never pick the lowest measurement, that just creates unrealistic expectations. Don't pick the average either, that is affected to much by the long delays that other processes can incur on your test. The median value is a good choice.

Time difference for same code of multithreading on different processors?

Hypothetical Question.
I wrote 1 multithreading code, which used to form 8 threads and process the data on different threads and complete the process. I am also using semaphore in the code. But it is giving me different execution time on different machines. Which is OBVIOUS!!
Execution time for same code:
On Intel(R) Core(TM) i3 CPU Machine: 36 sec
On AMD FX(tm)-8350 Eight-Core Processor Machine : 32 sec
On Intel(R) Core(TM) i5-2400 CPU Machine : 16.5 sec
So, my question is,
Is there any kind of setting/variable/command/switch i am missing which could be enabled in higher machine but not enabled in lower machine, which is making higher machine execution time faster? Or, is it the processor only, because of which the time difference is.
Any kind of help/suggestions/comments will be helpful.
Operating System: Linux (Centos5)
Multi-threading benchmarks should be performed with significant statistical sampling (ex: around 50 experiments per machines). Furthermore, the "environement" in which the program runs is important too (ex: was firefox running at the same time or not).
Also, depending on resources consumptions, runtimes can vary. In other words, without a more complete portrait of your experimental conditions, it's impossible to answer your question.
Some observations I have made from my personnal experiment:
Huge memory consumption can alter the results depending on the swapping settings on the machine.
Two "identical" machines with the same OS installed under the same conditions can show different results.
When total throughput is small compared to 5 mins, results appear pretty random.
etc.
I used to have a problem about time measure.My problem is the time in multithread is larger than that in single thread. Finally I found the problem is that not to measure the time in each thread and sum them but to measure out of the all thread. For example:
Wrong measure:
int main(void)
{
//create_thread();
//join_thread();
//sum the time
}
void thread(void *)
{
//measure time in thread
}
Right measure:
int main(void)
{
//record start time
//create_thread();
//join_thread();
//record end time
//calculate the diff
}
void thread(void *)
{
//measure time in thread
}

Resources