Round-robin processing with MPI (off by one/some) - loops

I have an MPI implementation basically for IDW2 based gridding on a set of sparsely sampled points. I have divided the jobs up as follows:
All nodes read all the data, the last node does not need to but whatever.
Node0 takes each data point and sends to nodes 1...N-1 with the following code:
int nodes_in_play = NNodes-2;
for(int i=0;i < data_size;i++)
{
int dest = (i%nodes_in_play)+1;
//printf("Point %d of %d going to %d\n",i+1,data_size,dest);
Error = MPI_Send(las_points[i],3,MPI_DOUBLE,dest,PIPE_MSG,MPI_COMM_WORLD);
if(Error != MPI_SUCCESS) break;
}
Nodes 1...N-1 perform IDW based estimates
for(int i=0;i<=data_size-nodes_in_play;i+=nodes_in_play)
{
Error = MPI_Recv(test_point,3,MPI_DOUBLE,0,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
if(status.MPI_TAG == END_MSG) break;
... IDW2 code
Error = MPI_Send(&zdiff,1,MPI_DOUBLE,NNodes-1,PIPE_MSG,MPI_COMM_WORLD);
}
Node N does receive and serializes to output file
This works fine for 3 nodes but with more nodes the IDW loop is off by some due to the tricky loop boundaries and the overall run gets stuck. What would be a simple way run the receive.. process .. send tasks in the in-between nodes. I am looking for a nifty for loop line.
What I have done:
Against my better judgement I have added a while(1) loop in the intermediate nodes, with an exit condition if a message with END_TAG is received. Node0 sends an END_TAG message to all intermediate nodes once all the points have been sent off.

The while loop is an ugly solution but works with an End flag. I will stick with that for now.

Related

Why am I getting huge slowdown when parallelising with OpenMP and using static scheduling?

I'm working to parallelise a disease spread model in c using OpenMP but am only seeing massive (order of magnitude) slowdown. I'll point out at the outset that I am a complete novice with both OpenMP and c.
The code loops over every point in the simulation and checks its status (susceptible, infected, recovered) and for each status, follows an algorithm to determine its status at the next time step.
I'll give the loop for infected points for illustrative purposes. Lpoints is a list of indices for points in the simulation, Nneigh gives the number of neighbours each point has and Lneigh gives the indices of these neighbours.
for (ipoint=0;ipoint<Nland;ipoint++) { //loop over all points
if (Lpoints_old[ipoint]==I) { //act on infected points
/* Probability Prec of infected population recovering */
xi = genrand();
if (xi<Pvac) { /* This point recovers (I->R) */
Lpoints[ipoint] = R;
/* printf("Point %d gained immunity\n",ipoint); */
}
else {
/* Probability of being blockaded by neighbours */
nsn = 0;
for (in=0;in<Nneigh[ipoint];in++) { /*count susceptible neighbours (nsn)*/
//if (npoint<0) printf("Bad npoint 1: %d in=%d\n",ipoint,in);
//fflush(stdout);
npoint = Lneigh[ipoint][in];
if (Lpoints_old[npoint]==S) nsn++;
}
Prob = (double)nsn*Pblo;
xi = genrand();
if (xi<Prob) { /* The population point is blockaded (I->R)*/
Lpoints[ipoint] = R;
}
else { /* Still infected */
Lpoints[ipoint] = I;
}
} /*else*/
} /*infected*/
} /*for*/
I tried to parallelise by adding #pragma omp parallel for default(shared) private(ipoint,xi,in,npoint,nsn,Prob) before the for loop. (I tried using default(none) as is generally recommended but it wouldn't compile.) On the small grid I am using to test the original series code runs in about 5 seconds and the OpenMP version runs in around 50.
I have searched for ages online and every similar problem seems to be the result of false cache sharing and has been solved by using static scheduling with a chunk size divisible by 8. I tried varying the chunk size to no effect whatsoever, only getting the timings to the original order when the chunk size surpassed the size of the problem (i.e. back to linearly carrying out on one thread.)
Slowdown doesn't seem any better when the problem is more appropriately scaled as far as I can tell either. I have no idea why this isn't working and what's going wrong. Any help greatly appreciated.

Small regex nfa matching while possibly larger matches are running

I'm creating my own lexical analyzer generator, similar to (f)lex. My plan was to make a tool like grep and go from there. Right now I've followed the articles by Russ Cox on creating an analyzer as a virtual machine: https://swtch.com/~rsc/regexp/regexp2.html
The issue I've run into is keeping track of small matches while a larger match is running. An example where this could happen is ".*d|ab" on the input string "abcabcabcabcabcabcd".
Currently, I have an NFA running which is practically the same as the one Cox made. It uses thompsons algorithm to compute dfa sets on the fly from the nfa. The threads are stored in a sparse set. (Preston Briggs and Linda Torczon's 1993 paper, “An Efficient Representation for Sparse Sets,”)
Threads that appear earlier in this list have larger importance and eliminate later threads upon accepting.
My current implementation can find a word matching a regex, but cannot continously keep on matching. To show what currently happens, lets look at the above example regex and string.
The first step makes three threads. Two for the left side of the union, one for the right side. Only the threads taking the wildcard and 'a' advance into the next list.
Step two advances again the wildcard and the thread taking a 'b' this time. It also adds new thread to try and match from this character as starting point.
Step three advances the wildcard and puts the thread running "ab" into an accepting state. Now the newer (less priority) threads are removed and a new threads are added to start matching from this character.
This is where the problem starts: The nfa cannot output a match on "ab" yet. It should wait for ".*d" to finish. This means that for a large input file, a lot of matches should be buffered until the ".*d" terminates (which is at the last character to get the potential largest match)
Thus, the actual question is: What is an efficient way to store these matches or is there another way to not have to potentially buffer the whole input file?
As a side note on the code: The code is so similar to the article of Cox that any questions related to how the code functions can be viewed in the article. Also, I've downloaded and tested the code of Cox and it had the same issue as described above.
What I hope to get with the solution to this question is to get my nfa implementation to match text in the exact same way grep or lex would: with the longest leftmost matches.
As requested, code fragments:
VM opcodes representing ".*d|ab":
0 SPLIT 2, 5
1 ANY
2 SPLIT 1, 3
3 CHAR 'd'
4 JMP 7
5 CHAR 'a'
6 CHAR 'b'
7 ACCEPT
CHAR asserts input character to match first operand.
SPLIT creates two new threads starting at the first and second operand.
ANY matches any input character.
JMP sets the program counter to the first operand.
ACCEPT puts the current thread in an accepting state.
An add function to recursively go through opcodes until reaching an CHAR or ACCEPT opcode then adding these to the next threadlist.
static void recursive_add(struct NFA *nfa, struct Thread *thread)
{
switch(nfa->code[thread->pc])
{
case OPCODE_JMP:
printf(" | jmp\n");
thread->pc = nfa->code[thread->pc + 1];
recursive_add(nfa, thread);
break;
case OPCODE_SPLIT:
{
uint16_t pc = thread->pc;
printf(" | split: %d op1: %d op2: %d\n", thread->pc, nfa->code[thread->pc + 1], nfa->code[thread->pc + 2]);
thread->pc = nfa->code[thread->pc + 1];
recursive_add(nfa, thread);
thread->pc = nfa->code[pc + 2];
recursive_add(nfa, thread);
break;
}
case OPCODE_CHAR: case OPCODE_ACCEPT:
printf(" | char/accept %d\n", thread->pc);
add_sparseset(nfa->nthreads, thread);
return;
default:
fprintf(stderr, "Unexpected opcode %x at program counter %d\n", nfa->code[thread->pc], thread->pc);
exit(1);
}
}
And the main loop, currently a bit messy but this should give a better idea of what the code does.
void execute_nfa(struct NFA *nfa)
{
struct Thread thread;
struct SparseSet *swap;
char c;
for(;;)
{
c = getchar();
printf("Input char %c\n", c);
//add new starting thread for every character
thread.pc = 0;
recursive_add(nfa, &thread);
clear_sparseset(nfa->cthreads);
swap = nfa->cthreads;
nfa->cthreads = nfa->nthreads;
nfa->nthreads = swap;
for(unsigned int i = 0; i < nfa->cthreads->size; i++)
{
thread = *(((struct Thread *)nfa->cthreads->dense) + i);
printf(" thread pc: %d\n", thread.pc);
switch(nfa->code[thread.pc])
{
case OPCODE_CHAR:
if(nfa->code[thread.pc + 1] == c)
{
printf(" Add to next\n");
thread.pc += 2;
recursive_add(nfa, &thread);
}
break;
case OPCODE_ACCEPT:
printf(" accept (still need to do shit)\n");
break;
default:
fprintf(stderr, "Unexpected opcode %x at program counter %d\n", nfa->code[thread.pc], thread.pc);
exit(1);
}
}
}
}

tasks run in thread takes longer than in serial?

So im doing some computation on 4 million nodes.
the very bask serial version just have a for loop which loops 4 million times and do 4 million times of computation. this takes roughly 1.2 sec.
when I split the for loop to, say, 4 for loops and each does 1/4 of the computation, the total time became 1.9 sec.
I guess there are some overhead in creating for loops and maybe has to do with cpu likes to compute data in chunk.
The real thing bothers me is when I try to put 4 loops to 4 thread on a 8 core machine, each thread would take 0.9 seconds to finish.
I am expecting each of them to only take 1.9/4 second instead.
I dont think there are any race condition or synchronize issue since all I do was having a for loop to create 4 threads, which took 200 microseconds. And then a for loop to joins them.
The computation read from a shared array and write to a different shared array.
I am sure they are not writing to the same byte.
Where could the overhead came from?
main: ncores: number of cores. node_size: size of graph (4 million node)
for(i = 0 ; i < ncores ; i++){
int *t = (int*)malloc(sizeof(int));
*t = i;
int iret = pthread_create( &thread[i], NULL, calculate_rank_p, (void*)(t));
}
for (i = 0; i < ncores; i++)
{
pthread_join(thread[i], NULL);
}
calculate_rank_p: vector is the rank vector for page rank calculation
Void *calculate_rank_pthread(void *argument) {
int index = *(int*)argument;
for(i = index; i < node_size ; i+=ncores)
current_vector[i] = calc_r(i, vector);
return NULL;
}
calc_r: this is just a page rank calculation using compressed row format.
double calc_r(int i, double *vector){
double prank = 0;
int j;
for(j = row_ptr[i]; j < row_ptr[i+1]; j++){
prank += vector[col_ind[j]] * val[j];
}
return prank;
}
everything that is not declared are global variable
The computation read from a shared array and write to a different shared array. I am sure they are not writing to the same byte.
It's impossible to be sure without seeing relevant code and having some more details, but this sounds like it could be due to false sharing, or ...
the performance issue of false sharing (aka cache line ping-ponging), where threads use different objects but those objects happen to be close enough in memory that they fall on the same cache line, and the cache system treats them as a single lump that is effectively protected by a hardware write lock that only one core can hold at a time. This causes real but invisible performance contention; whichever thread currently has exclusive ownership so that it can physically perform an update to the cache line will silently throttle other threads that are trying to use different (but, alas, nearby) data that sits on the same line.
http://www.drdobbs.com/parallel/eliminate-false-sharing/217500206
UPDATE
This looks like it could very well trigger false sharing, depending on the size of a vector (though there is still not enough information in the post to be sure, as we don't see how the various vector are allocated.
for(i = index; i < node_size ; i+=ncores)
Instead of interleaving which core works on which data i += ncores give each of them a range of data to work on.
For me the same surprise when build and run in Debug (other test code though).
In release all as expected ;)

How do I create a "twirly" in a C program task?

Hey guys I have created a program in C that tests all numbers between 1 and 10000 to check if they are perfect using a function that determines whether a number is perfect. Once it finds these it prints them to the user, they are 6, 28, 496 and 8128. After this the program then prints out all the factors of each perfect number to the user. This is all fine. Here is my problem.
The final part of my task asks me to:
"Use a "twirly" to indicate that your program is happily working away. A "twirly" is the following characters printed over the top of each other in the following order: '|' '/' '-' '\'. This has the effect of producing a spinning wheel - ie a "twirly". Hint: to do this you can use \r (instead of \n) in printf to give a carriage return only (instead of a carriage return linefeed). (Note: this may not work on some systems - you do not have to do it this way.)"
I have no idea what a twirly is or how to implement one. My tutor said it has something to do with the sleep and delay functions which I also don't know how to use. Can anyone help me with this last stage, it sucks that all my coding is complete but I can't get this "twirly" thing to work.
if you want to simultaneously perform the task of
Testing the numbers and
Display the twirly on screen
while the process goes on then you better look into using threads. using POSIX threads you can initiate the task on a thread and the other thread will display the twirly to the user on terminal.
#include<stdlib.h>
#include<pthread.h>
int Test();
void Display();
int main(){
// create threads each for both tasks test and Display
//call threads
//wait for Test thread to finish
//terminate display thread after Test thread completes
//exit code
}
Refer chapter 12 for threads
beginning linux programming ebook
Given the program upon which the user is "waiting", I believe the problem as stated and the solutions using sleep() or threads are misguided.
To produce all the perfect numbers below 10,000 using C on a modern personal computer takes about 1/10 of a second. So any device to show the computer is "happily working away" would either never be seen or would significanly intefere with the time it takes to get the job done.
But let's make a working twirly for perfect number search anyway. I've left off printing the factors to keep this simple. Since 10,000 is too low to see the twirly in action, I've upped the limit to 100,000:
#include <stdio.h>
#include <string.h>
int main()
{
const char *twirly = "|/-\\";
for (unsigned x = 1; x <= 100000; x++)
{
unsigned sum = 0;
for (unsigned i = 1; i <= x / 2; i++)
{
if (x % i == 0)
{
sum += i;
}
}
if (sum == x)
{
printf("%d\n", x);
}
printf("%c\r", twirly[x / 2500 % strlen(twirly)]);
}
return 0;
}
No need for sleep() or threads, just key it into the complexity of the problem itself and have it update at reasonable intervals.
Now here's the catch, although the above works, the user will never see a fifth perfect number pop out with a 100,000 limit and even with a 100,000,000 limit, which should produce one more, they'll likely give up as this is a bad (slow) algorithm for finding them. But they'll have a twirly to watch.
i as integer
loop i: 1 to 10000
loop j: 1 to i/2
sum as integer
set sum = 0
if i%j == 0
sum+=j
return sum==i
if i%100 == 0
str as character pointer
set *str = "|/-\\"
set length = 4
print str[p] using "%c\r" as format specifier
Increment p and assign its modulo by len to p

Bizarre memory issue?

I'm having an interesting problem, which I hope is entirely my fault.
I have code which is reading from a queue, as in:
do {
evt = &newevts[ evt_head++ ];
evt_head &= MAX_EVENTS;
if (evt->index <= 0 || evt->index > MAX_INDEX) {
printf("RX EVENT BAD NDX: ndx=%d h=%d\n",evt->index, evt_head);
continue;
}
//... etc ...
} while(evt_head != evt_tail) ;
The bizarre issue is the if statement can evaluate to evt->index being a bad value, but when the printf displays it shows a perfectly valid value! Example:
RX EVENT BAD NDX: ndx=1 h=64
The if statement clearly shows the condition must be <= 0 OR > 1024 (max index). To make matters worse, this only occurs once in a while. I'm using GCC, Centos 6.3. No threads touch evt_head except this thread. (I've renamed it a few times and re-compiled just to be sure.)
The tail is handled by a function which adds items to the queue in the same manner the head removes them (increment then AND). I have also added a counter inside the event structure itself to record the head/tail values as events are placed into the queue and find no lost or skipped values. It literally looks as though I'm getting some bad memory reads. But that's ridiculous - I'd expect system crashes or at least program crashes if that was the case.
Any ideas on how in the world this could be happening sporadically? (Frequency is about 1 out of 100 reads) I appreciate any input!
typedef struct {
int index;
int event;
} EVENT;
#define MAX_EVENTS 0x01ff
#define MAX_INDEX 1024
No threads or other code touches evt_head. Only this loop. The queue is never anywhere near full. I also happen to have a "SPIN LOCK" on entry to the routine which adds to the queue (in preparation for it being other-thread-accessed later), and an UNLOCK on exit.
My guess is that the function adding events to your tail will change evt_tail before writing the index field. This allows your reader to access an event that is still in the process of being written.

Resources