error while trying to run MPI program with username - c

When I run program via:
myshell$] mpirun --hosts localhost,192.168.1.4 ./a.out
the program executes successfully. Now when I try to run:
myshell$] mpirun --hosts localhost,myac#192.168.1.4 ./a.out
openssh prompts for password. I get:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(433)..............:
MPID_Init(176).....................: channel initialization failed
MPIDI_CH3_Init(70).................:
MPID_nem_init(286).................:
MPID_nem_tcp_init(108).............:
MPID_nem_tcp_get_business_card(354):
MPID_nem_tcp_init(313).............: gethostbyname failed, myac#192.168.1.4 (errno 1)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0#myac] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0#myac] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#myac] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#myac] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#myac] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#myac] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec#myac] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Why am I getting error when I am providing the username?

You could try specifying a username in your ssh config file (http://www.cyberciti.biz/faq/create-ssh-config-file-on-linux-unix/) instead of on the mpirun command line. That way perhaps mpirun would not be confused by the extra username part, which as far as I can see from the documentation it does not support. But ssh could, behind the scenes, use the username you specify in your ssh config file. And of course you'll want to set up SSH keys so you don't have to type a password.

I don't believe MPICH supports providing usernames in the --hosts value on the command line. You should try the host file based method described on the wiki. http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Using_Hydra_on_Machines_with_Different_User_Names
For example:
shell$ cat hosts
donner user=foo
foo user=bar
shakey user=bar
terra user=foo

Related

Cause: Command execution failed on the local server with non-zero exit code

Failed to fetch information from target servers
Cause: Command execution failed on the local server with non-zero exit code.
command: /usr/local/psa/bin/ipmanage --xml-info
exit code: 255
stdout: <ipinfo>
<ip name="193.160.214.57">
<state>0</state>
<type>shared</type>
<ip_address>193.160.214.57</ip_address>
<mask>255.255.255.255</mask>
<iface>venet0</iface>
<clients>0</clients>
<hostings>0</hostings>
<ftps>false</ftps>
<publicIp></publicIp>
</ip>
</ipinfo>
stderr: [2019-10-20 21:21:51.133] ERR [util_exec] proc_close() failed ['/usr/local/psa/admin/bin/f2bmng' '--reload'] with exit code [1]
PHP Fatal error: Uncaught PleskUtilException: f2bmng failed: 2019-10-20 21:21:51,115 fail2ban.jailreader [17670]: ERROR No file(s) found for glob /var/log/secure
2019-10-20 21:21:51,115 fail2ban [17670]: ERROR Failed during configuration: Have not found any log file for ssh jail
ERROR:__main__:Command '['/usr/bin/fail2ban-client', 'reload']' returned non-zero exit status 255 in /usr/local/psa/admin/plib/Service/Agent.php:210
Stack trace:
#0 /usr/local/psa/admin/plib/Ip/Ban/Manager.php(490): Service_Agent->execAndGetResponse('f2bmng', Array, '')
#1 /usr/local/psa/admin/plib/Ip/Ban/Manager.php(458): Ip_Ban_Manager->_callUtility('--reload')
#2 /usr/local/psa/admin/plib/Fail2Ban/EventListener.php(123): Ip_Ban_Manager->reload()
#3 [internal function]: Plesk\Fail2Ban\EventListener->applyChanges()
#4 {main}
thrown in /usr/local/psa/admin/plib/Service/Agent.php on line 210
That is a critical error, migration was stopped.
I don't know what is "wrong" with your plesk (not so familiar with), but fail2ban error is pretty simply:
ERROR No file(s) found for glob /var/log/secure
2019-10-20 21:21:51,115 fail2ban [17670]: ERROR Failed during configuration: Have not found any log file for ssh jail
Your ssh jail seems to be configured to monitor /var/log/secure which is not exist. Either you have to specify proper logpath (/var/log/auth.log?) where ssh logs authentication errors;
or if it is systemd journal on your system, you have to specify backend = systemd for that.
Related fail2ban jail.local would be:
[ssh]
# backend = systemd
logpath = /var/log/auth.log
But you can surely configure this in plesk settings too.
Also note your jail is called ssh, where normally original default jail of fail2ban is sshd (but it could be indeed configured with this name from your maintainer).

How to handle exit codes with Camel Exec?

I'm using Camel Exec for automated shutdowns on some of our devices.
The shutdown command is pretty simple, and it mostly works fine:
from(START_DEEP_SLEEP)
.setBody(constant(null)) // we don't want stdin for exec
.setHeader(ExecBinding.EXEC_COMMAND_ARGS, constant("""shutdown $shutdownDelay "starting deep sleep shutdown" """))
.to("exec:sudo")
Obviously, this command will send a shutdown to the application executing it. That too isn't much of an issue, except that sometimes this produces an exit value of 143. I know the meaning of the return value, and it makes sense to see it here, but this only happens on some devices. Most others just return 0. They are all of the same type, so I really don't know where this discrepancy comes from, but it's not even that big an issue. The shutdown works none the less.
The problem is that camel exec logs this as an error:
ERROR 549 --- [Camel (camel-1) thread #1 - seda://start-deepsleep] o.a.camel.component.exec.ExecProducer : The command ExecCommand [args=[shutdown, now, starting deep sleep shutdown], executable=sudo, timeout=9223372036854775807, outFile=null, workingDir=null, useStderrOnEmptyStdout=false] returned exit value 143
This produces undesired noise in our monitoring, and I would rather not have it logged.
The core issue here is that Camel Exec does not throw, so there's no exception I could handle. It just logs the error, which then gets picked up by our log analysis.
I would like to handle that exit code gracefully without camel Exec logging an error. The return value is already logged separately anyways. How can I do that?
According to the docu http://camel.apache.org/exec.html there is a header ExecBinding.EXEC_EXIT_VALUE filled with the error number. Yours should be 143 (the docu states that this depends on the OS).
That could be a "hook" to handle the log entry, e.g. deleting the last entry with the same error number.
Of course this is only a cosmetic fix. The implementation could be like this:
from(START_DEEP_SLEEP)
.setBody(constant(null)) // we don't want stdin for exec
.setHeader(ExecBinding.EXEC_COMMAND_ARGS, constant("""shutdown $shutdownDelay "starting deep sleep shutdown" """))
.to("exec:sudo")
.when(header(ExecBinding.EXEC_EXIT_VALUE))
.to("direct:edit_the_log")
Please note that I did not test that code. Maybe u access that header with
.when(header(EXEC_EXIT_VALUE))
instead.
Please, inform me if that could be a proper solution or not.

Core dump when running C program using systemd

I have a program in C that runs well when running it directly from the comand line but fails when running it with systemd:
Core was generated by `/usr/local/bin/midnite-modbusd'.
Program terminated with signal SIGFPE, Arithmetic exception.
#0 0x0000000000401308 in main (argc=1, argv=0x7ffeae390268) at src/midnite-modbusd.c:139
139 slen= interval - (millis % interval);
The code in question:
//wait for start of each sample interval
gettimeofday(&tv,NULL);
millis= (long long unsigned)tv.tv_sec*1000 + (tv.tv_usec/1000);
slen= interval - (millis % interval);
i= (millis+slen) % 1000;
usleep (slen*1000);
The full code is available on github.
The systyemd unit:
[Unit]
Description=Midnite Classic modbus data polling
After=network.target
[Service]
Type=simple
User=midnite-modbusd
ExecStart=/usr/local/bin/midnite-modbusd
Restart=on-failure
[Install]
WantedBy=multi-user.target
What can be so different when a program runs with systemd ?
Edit 1
It seems that my program has major issues that only happen when running with systemd:
it won't read my configuration file, which should throw an error message and exit(1) because of invalid values
journactl doesn't get filled in real time. Using journactl -f I have to wait a couple of minutes before seeing a bunch of logs that appear suddenly
As a side note for my tests using the command line I run: sudo -H -u midnite-modbusd /usr/local/bin/midnite-modbusd
A defined value of sample_interval from configuration file will initialize the interval, please check if the file is correct and sample_interval is present. An uninitialized value of interval might cause the divide by zero exception
I found the issue in this code:
if (getppid()==1) {
sprintf(str, "Daemon aready running");
log_message(log_file_path,(char*)str);
return;
}
This code is here for when the program was intended to fork itself to run as an "old style" daemon.
I didn't realize that, as systemd is forking it, then the program have a parent process (thus getppid() returning 1 when running with systemd but not from the command line)
Anyway it is badly written: this test should stop the script.

OpenMPI bind() failed on error Address already in use (48) Mac OS X

I have installed OpenMPI and tried to compile/execute one of the examples delivered with the newest version.
As I try to run with mpiexec it says that the address is already in use.
Someone got a hint why this is always happening?
Kristians-MacBook-Pro:examples kristian$ mpicc -o hello hello_c.c
Kristians-MacBook-Pro:examples kristian$ mpiexec -n 4 ./hello
[Kristians-MacBook-Pro.local:02747] [[56076,0],0] bind() failed on error Address already in use (48)
[Kristians-MacBook-Pro.local:02747] [[56076,0],0] ORTE_ERROR_LOG: Error in file oob_usock_component.c at line 228
[Kristians-MacBook-Pro.local:02748] [[56076,1],0] usock_peer_send_blocking: send() to socket 19 failed: Socket is not connected (57)
[Kristians-MacBook-Pro.local:02748] [[56076,1],0] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 315
[Kristians-MacBook-Pro.local:02748] [[56076,1],0] orte_usock_peer_try_connect: usock_peer_send_connect_ack to proc [[56076,0],0] failed: Unreachable (-12)
[Kristians-MacBook-Pro.local:02749] [[56076,1],1] usock_peer_send_blocking: send() to socket 20 failed: Socket is not connected (57)
[Kristians-MacBook-Pro.local:02749] [[56076,1],1] ORTE_ERROR_LOG: Unreachable in file oob_usock_connection.c at line 315
[Kristians-MacBook-Pro.local:02749] [[56076,1],1] orte_usock_peer_try_connect: usock_peer_send_connect_ack to proc [[56076,0],0] failed: Unreachable (-12)
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[56076,1],0]
Exit code: 1
--------------------------------------------------------------------------
Thanks in advance.
Okay.
I have now changed the $TMPDIR environment variable with export TMPDIR=/tmp and it works.
Now it seems to me that the OpenMPI Session folder was blocking my communication. But why did it?
Am I missing something here?

MPI number of process

I am running a sample MPI program which prints hello world.
When I am running with 1,2....330 process it runs as expected.
But when the number goes beyond 330 it fails with below error.
Can some explain the reason for this.
I am running the program on my laptop which has i5 processor with 4 cores and 8 GB RAM.
[proxy:0:0#Abhishek-Machine] HYDU_create_process (./utils/launch/launch.c:25): pipe error (Too many open files)
[proxy:0:0#Abhishek-Machine] launch_procs (./pm/pmiserv/pmip_cb.c:705): create process returned error
[proxy:0:0#Abhishek-Machine] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:893): launch_procs returned error
[proxy:0:0#Abhishek-Machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#Abhishek-Machine] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#Abhishek-Machine] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#Abhishek-Machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#Abhishek-Machine] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#Abhishek-Machine] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
You are hitting an OS limit for socket descriptors or similar. Over subscribing your workstation to this degree is not a good idea and unlikely to work unless you change your system settings (which is not a good idea for this use case).

Resources