OpenMPI 1.4.3 mpirun hostfile error - c

I am trying to run a simple MPI program on 4 nodes. I am using OpenMPI 1.4.3 running on Centos 5.5. When I submit the MPIRUN Command with the hostfile/machinefile, I get no output, receive a blank screen. Hence, I have to kill the job. .
I use the following run command: : mpirun --hostfile hostfile -np 4 new46
OUTPUT ON KILLING JOB:
mpirun: killing job...
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process that caused
that situation.
--------------------------------------------------------------------------
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--------------------------------------------------------------------------
myocyte46 - daemon did not report back when launched
myocyte47 - daemon did not report back when launched
myocyte49 - daemon did not report back when launched
Here is the MPI program I am trying to execute on 4 nodes
**************************
if (my_rank != 0)
{
sprintf(message, "Greetings from the process %d!", my_rank);
dest = 0;
MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}
else
{
for (source = 1;source < p; source++)
{
MPI_Recv(message, 100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status);
printf("%s\n", message);
}
****************************
My hostfile looks like this:
[amohan#myocyte48 ~]$ cat hostfile
myocyte46
myocyte47
myocyte48
myocyte49
*******************************
I ran the above MPI program independently on each of the nodes and it compiled and ran just fine. I have this issue of "Daemon did not report back when launched" when I use the hostfile. I am trying to figure out what could be the issue.
Thanks!

I think these lines
myocyte46 - daemon did not report back when launched
are pretty clear -- you're having trouble either launching the mpi daemons or communicating with them afterwards. So you need to start looking at networking. Can you ssh without password into these nodes? Can you ssh back? Leaving aside the MPI program, can you
mpirun -np 4 hostname
and get anything?

Related

Core dump when running C program using systemd

I have a program in C that runs well when running it directly from the comand line but fails when running it with systemd:
Core was generated by `/usr/local/bin/midnite-modbusd'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0 0x0000000000401308 in main (argc=1, argv=0x7ffeae390268) at src/midnite-modbusd.c:139
139 slen= interval - (millis % interval);
The code in question:
//wait for start of each sample interval
gettimeofday(&tv,NULL);
millis= (long long unsigned)tv.tv_sec*1000 + (tv.tv_usec/1000);
slen= interval - (millis % interval);
i= (millis+slen) % 1000;
usleep (slen*1000);
The full code is available on github.
The systyemd unit:
[Unit]
Description=Midnite Classic modbus data polling
After=network.target
[Service]
Type=simple
User=midnite-modbusd
ExecStart=/usr/local/bin/midnite-modbusd
Restart=on-failure
[Install]
WantedBy=multi-user.target
What can be so different when a program runs with systemd ?
Edit 1
It seems that my program has major issues that only happen when running with systemd:
it won't read my configuration file, which should throw an error message and exit(1) because of invalid values
journactl doesn't get filled in real time. Using journactl -f I have to wait a couple of minutes before seeing a bunch of logs that appear suddenly
As a side note for my tests using the command line I run: sudo -H -u midnite-modbusd /usr/local/bin/midnite-modbusd
A defined value of sample_interval from configuration file will initialize the interval, please check if the file is correct and sample_interval is present. An uninitialized value of interval might cause the divide by zero exception
I found the issue in this code:
if (getppid()==1) {
sprintf(str, "Daemon aready running");
log_message(log_file_path,(char*)str);
return;
}
This code is here for when the program was intended to fork itself to run as an "old style" daemon.
I didn't realize that, as systemd is forking it, then the program have a parent process (thus getppid() returning 1 when running with systemd but not from the command line)
Anyway it is badly written: this test should stop the script.

MPI programs hanging up

I installed mpich2 on my Ubuntu 14.04 laptop with the following command:
sudo apt-get install libcr-dev mpich2 mpich2-doc
This is the code I'm trying to execute:
#include <mpi.h>
#include <stdio.h>
int main()
{
int myrank, size;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello world! I am %d of %d\n", myrank, size);
MPI_Finalize();
return 0;
}
Compiling it as mpicc helloworld.c gives no errors. But when I execute the program as: mpirun -np 5 ./a.out There is no output, the program just keeps executing as if it were in an infinite loop. On pressing Ctrl+C, this is what I get:
$ mpirun -np 5 ./a.out
^C[mpiexec#user] Sending Ctrl-C to processes as requested
[mpiexec#user] Press Ctrl-C again to force abort
[mpiexec#user] HYDU_sock_write (./utils/sock/sock.c:291): write error (Bad file descriptor)
[mpiexec#user] HYD_pmcd_pmiserv_send_signal (./pm/pmiserv/pmiserv_cb.c:170): unable to write data to proxy
[mpiexec#user] ui_cmd_cb (./pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec#user] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#user] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#user] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
I couldn't get any solution on googling. What is causing this error?
I was getting the same issue with two compute nodes:
$ mpirun -np 10 -ppn 5 --hosts c1,c2 ./a.out
[mpiexec#c1] Press Ctrl-C again to force abort
[mpiexec#c1] HYDU_sock_write (utils/sock/sock.c:286): write error (Bad file descriptor)
[mpiexec#c1] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:169): unable to write data to proxy
[mpiexec#c1] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:79): unable to send signal downstream
[mpiexec#c1] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[mpiexec#c1] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:198): error waiting for event
[mpiexec#c1] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
Turns out c1 node couldn't ssh c2.
If you are using only single machine, you can try using fork as launcher:
mpirun -launcher fork -np 5 ./a.out

How to configure GDB in Eclipse such that all prcoesses keep on running including the process being debugged?

I am new in C programming and I have been trying hard to customize an opensource tool written in C according to my organizational needs.
IDE: Eclipse,
Debugger: GDB,
OS: RHEL
The tool is multi-process in nature (main process executes first time and spawns several child processes using fork() ) and they share values in run time.
While debugging in Eclipse (using GDB), I find that the process being debugged is only running while other processes are in suspended mode. Thus, the only running process is not able to do its intended job because the other processes are suspended.
I saw somewhere that using MI command in GDB as "set non-stop on" could make other processes running. I used the same command in the gdbinit file shown below:
Note: I have overridden above .gdbinit file with an another gdbinit because the .gdbinit is not letting me to debug child processes as debugger terminates after the execution of main process.
But unfortunately debugger stops responding after using this command.
Please see below commands I am using in the gdbinit file:
Commenting non-stop enables Eclipse to continue usual debugging of the current process.
Adding: You can see in below image that only one process is running while others are suspended.
Can anyone please help me to configure GDB according to my requirement?
Thanks in advance.
OK #n.m.: Actually, You were right. I should have given more time to understand the flow of the code.
The tool creates 3 processes first and then the third process creates 5 threads and keeps on wait() for any child thread to terminate.
Top 5 threads (highlighted in blue) shown in the below image are threads and they are children of Process ID: 17991
The first two processes are intended to initiate basic functionality of the tool and hence they just wait to get exit(0). You can see below.
if (0 != (pid = zbx_fork()))
exit(0);
setsid();
signal(SIGHUP, SIG_IGN);
if (0 != (pid = zbx_fork()))
exit(0);
That was the reason I was not actually able to step in these 3 processes. Whenever, I tried to do so, the whole main process terminated immediately and consequently leaded to terminate all other processes.
So, I learned that I was supposed to "step-into" threads only. And yes, actually I can now debug :)
And this could be achieved because I had to remove the MI command "set follow-fork-mode child". So, I just used the default " .gdbinit" file with enabled "Automatically debug forked process".
Thanks everyone for your input. Stackoverflow is an awesome place to learn and share. :)

error while trying to run MPI program with username

When I run program via:
myshell$] mpirun --hosts localhost,192.168.1.4 ./a.out
the program executes successfully. Now when I try to run:
myshell$] mpirun --hosts localhost,myac#192.168.1.4 ./a.out
openssh prompts for password. I get:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(433)..............:
MPID_Init(176).....................: channel initialization failed
MPIDI_CH3_Init(70).................:
MPID_nem_init(286).................:
MPID_nem_tcp_init(108).............:
MPID_nem_tcp_get_business_card(354):
MPID_nem_tcp_init(313).............: gethostbyname failed, myac#192.168.1.4 (errno 1)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0#myac] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0#myac] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#myac] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#myac] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#myac] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#myac] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec#myac] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Why am I getting error when I am providing the username?
You could try specifying a username in your ssh config file (http://www.cyberciti.biz/faq/create-ssh-config-file-on-linux-unix/) instead of on the mpirun command line. That way perhaps mpirun would not be confused by the extra username part, which as far as I can see from the documentation it does not support. But ssh could, behind the scenes, use the username you specify in your ssh config file. And of course you'll want to set up SSH keys so you don't have to type a password.
I don't believe MPICH supports providing usernames in the --hosts value on the command line. You should try the host file based method described on the wiki. http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Using_Hydra_on_Machines_with_Different_User_Names
For example:
shell$ cat hosts
donner user=foo
foo user=bar
shakey user=bar
terra user=foo

MPICH: How to publish_name such that a client application can lookup_name it?

While learning MPI using MPICH in windows (1.4.1p1) I found some sample code here. Originally, when I ran the server, I would have to copy the generated port_name and start the client with it. That way, the client can connect to the server. I modified it to include MPI_Publish_name() in the server instead. After launching the server with a name of aaaa, I launch the client which fails MPI_Lookup_name() with
Invalid service name (see MPI_Publish_name), error stack:
MPID_NS_Lookup(87): Lookup failed for service name aaaa
Here are the snipped bits of code:
server.c
MPI_Comm client;
MPI_Status status;
char port_name[MPI_MAX_PORT_NAME];
char serv_name[256];
double buf[MAX_DATA];
int size, again;
int res = 0;
MPI_Init( &argc, &argv );
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Open_port(MPI_INFO_NULL, port_name);
sprintf(serv_name, "aaaa");
MPI_Publish_name(serv_name, MPI_INFO_NULL, port_name);
while (1)
{
MPI_Comm_accept( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &client );
/*...snip...*/
}
client.c
MPI_Comm server;
double buf[MAX_DATA];
char port_name[MPI_MAX_PORT_NAME];
memset(port_name,'\0',MPI_MAX_PORT_NAME);
char serv_name[256];
memset(serv_name,'\0',256);
strcpy(serv_name, argv[1] )
MPI_Lookup_name(serv_name, MPI_INFO_NULL, port_name);
MPI_Comm_connect( port_name, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &server );
MPI_Send( buf, 0, MPI_DOUBLE, 0, tag, server );
MPI_Comm_disconnect( &server );
MPI_Finalize();
return 0;
I cannot find any information about altering visibility of published names, if that is even the problem. MPICH seems to not have implemented anything with MPI_INFO. I would try openMPI but I am having trouble just building it. Any suggestions?
I uploaded a working version using OpenMPI 1.6.5 of a client and server in C on Ubuntu that uses the ompi-server name server here:
OpenMPI nameserver client server example in C
(digging up old stuff)
For MPICH, the code by #daemondave should actually work as well.
It does, however, still require to get a nameserver running.
For MPICH, instead of using ompi-server, this can be done using hydra_nameserver.
The host then has to be specified for all the mpirun/mpiexec calls using -nameserver HOSTNAME.
I created a working example over at github, which also provides a shell script to build+run the example.
P.S: the ompi-server variant seems to be somewhat outdated (and includes a few bugs).
For an updated, but still, somewhat undocumented alternative, see this comment.
This approach of publishing names, looking them up, and connecting to them is outlandish relative to normal MPI usage.
The standard pattern is to use mpirun to specify a set of nodes on which to launch a given number of processes. The operation of common implementations of mpirun implementations is explained in another question
Once the processes are all launched as part of a single parallel job, the MPI library reads whatever information the launcher provided during MPI_Init to set up MPI_COMM_WORLD, a communicator over the group of all processes in the job.
Using that communicator, the parallel application can distribute work, exchange information, and so forth. It would do this using the common MPI_Send and MPI_Recv routines, in all their variants, the collective operations, and so forth.

Resources