MPI number of process - c

I am running a sample MPI program which prints hello world.
When I am running with 1,2....330 process it runs as expected.
But when the number goes beyond 330 it fails with below error.
Can some explain the reason for this.
I am running the program on my laptop which has i5 processor with 4 cores and 8 GB RAM.
[proxy:0:0#Abhishek-Machine] HYDU_create_process (./utils/launch/launch.c:25): pipe error (Too many open files)
[proxy:0:0#Abhishek-Machine] launch_procs (./pm/pmiserv/pmip_cb.c:705): create process returned error
[proxy:0:0#Abhishek-Machine] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:893): launch_procs returned error
[proxy:0:0#Abhishek-Machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#Abhishek-Machine] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#Abhishek-Machine] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#Abhishek-Machine] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#Abhishek-Machine] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#Abhishek-Machine] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

You are hitting an OS limit for socket descriptors or similar. Over subscribing your workstation to this degree is not a good idea and unlikely to work unless you change your system settings (which is not a good idea for this use case).

Related

Error on iOS14 when loading OBJs into MDLAsset

When loading OBJs into an MDLAsset using the MDLAsset(url:) initializer (to eventually get the model into SceneKit), the operation fails frequently and inconsistently on iOS14. This operation works fine for these same files on previous iOS versions. I've also observed the bug on iPadOS, although maybe less frequently. Not sure if it's relevant, but these OBJs are pulled from server and stored locally. But this bug is occurring after files are already downloaded. Sometimes the same file will fail multiple times before randomly working, and vice versa.
The console output seems to indicate a failure to communicate with ModelIO XPC service. I tried restarting my device, but the bug continues to occur. Console output:
connection to com.apple.ModelIO.AssetLoader was interrupted
AssetLoader.loadURL errorHandler: Error Domain=NSCocoaErrorDomain Code=4097 "connection to service on pid 0 named com.apple.ModelIO.AssetLoader" UserInfo={NSDebugDescription=connection to service on pid 0 named com.apple.ModelIO.AssetLoader}
Couldn’t communicate with a helper application.
connection to com.apple.ModelIO.AssetLoader was interrupted
Has anyone else run into this issue on iOS14?
Alternatively, are there any workarounds anyone has tried in the meantime? As far as I know, loading an OBJ (that is downloaded from server) into SceneKit can only be done through ModelIO, without writing an OBJ parser myself.
This seems to be fixed in 14.3.
2020-10-13 18:31:36.989282+0300 Studia3D Viewer[1452:348335] connection to com.apple.ModelIO.AssetLoader was interrupted
2020-10-13 18:31:36.989368+0300 Studia3D Viewer[1452:347676] AssetLoader.loadURL errorHandler: Error Domain=NSCocoaErrorDomain Code=4097 “connection to service on pid 0 named com.apple.ModelIO.AssetLoader” UserInfo={NSDebugDescription=connection to service on pid 0 named com.apple.ModelIO.AssetLoader}
2020-10-13 18:31:36.989404+0300 Studia3D Viewer[1452:348332] connection to com.apple.ModelIO.AssetLoader was interrupted
2020-10-13 18:31:36.997352+0300 Studia3D Viewer[1452:347676] Не удалось установить связь с приложением-помощником.
The same thing happens with local files
No solution yet

Forked child keeps being terminated with status 0x008B

I'm on a VirtualBox with Ubuntu 18.10 installed on, and I'm new using it. My code creates 100 forked child that works on a shared memory. SOMETIME I get this message
Sender(Pid = (childPID)) terminated with status 0x008B.
Searching in the web I found that could be a SIGSEGV error. Is it true?
Finally, is there any way to find WHERE the code fails in over 1000 lines? I tryed using this Guide: http://www.unknownroad.com/rtfm/gdbtut/gdbsegfault.html to find the error with gdb but my terminal says me that I have "No Stack". I'm totally new with this kind of problems, any hint will be appreciated.
Sender(Pid = (childPID)) terminated with status 0x008B.
Searching in the web I found that could be a SIGSEGV error. Is it true?
Yes, that indicates termination by signal 11 (0xB).
Finally, is there any way to find WHERE the code fails in over 1000 lines?
I'd run the program with valgrind.

How to handle exit codes with Camel Exec?

I'm using Camel Exec for automated shutdowns on some of our devices.
The shutdown command is pretty simple, and it mostly works fine:
from(START_DEEP_SLEEP)
.setBody(constant(null)) // we don't want stdin for exec
.setHeader(ExecBinding.EXEC_COMMAND_ARGS, constant("""shutdown $shutdownDelay "starting deep sleep shutdown" """))
.to("exec:sudo")
Obviously, this command will send a shutdown to the application executing it. That too isn't much of an issue, except that sometimes this produces an exit value of 143. I know the meaning of the return value, and it makes sense to see it here, but this only happens on some devices. Most others just return 0. They are all of the same type, so I really don't know where this discrepancy comes from, but it's not even that big an issue. The shutdown works none the less.
The problem is that camel exec logs this as an error:
ERROR 549 --- [Camel (camel-1) thread #1 - seda://start-deepsleep] o.a.camel.component.exec.ExecProducer : The command ExecCommand [args=[shutdown, now, starting deep sleep shutdown], executable=sudo, timeout=9223372036854775807, outFile=null, workingDir=null, useStderrOnEmptyStdout=false] returned exit value 143
This produces undesired noise in our monitoring, and I would rather not have it logged.
The core issue here is that Camel Exec does not throw, so there's no exception I could handle. It just logs the error, which then gets picked up by our log analysis.
I would like to handle that exit code gracefully without camel Exec logging an error. The return value is already logged separately anyways. How can I do that?
According to the docu http://camel.apache.org/exec.html there is a header ExecBinding.EXEC_EXIT_VALUE filled with the error number. Yours should be 143 (the docu states that this depends on the OS).
That could be a "hook" to handle the log entry, e.g. deleting the last entry with the same error number.
Of course this is only a cosmetic fix. The implementation could be like this:
from(START_DEEP_SLEEP)
.setBody(constant(null)) // we don't want stdin for exec
.setHeader(ExecBinding.EXEC_COMMAND_ARGS, constant("""shutdown $shutdownDelay "starting deep sleep shutdown" """))
.to("exec:sudo")
.when(header(ExecBinding.EXEC_EXIT_VALUE))
.to("direct:edit_the_log")
Please note that I did not test that code. Maybe u access that header with
.when(header(EXEC_EXIT_VALUE))
instead.
Please, inform me if that could be a proper solution or not.

PostgreSql crashed with error: 'server process (PID XXXX) was terminated by exception 0xC0000142'

I have Postgresql 9.2 running on a 4G memory, Atom N2800 CPU Windows POS READY embedded system(like the XP) machine, basically it running fine for years in production environment, but crashed(service stopped) frequently in recent performance(not stress) testing.
I don't think the testing put too much stress, by enabled the log_min_duration_statement = 0, the simplified overall statistics for what the testing have done listed below:
say 20 minutes is a measure unit, so during one unit:
5000 times of UPDATE with each query contains 20KB size of data(contains a Text field).
35000 times of SELECT with each query returned 20KB size of data(to get that Text field).
the logs didn't see any abnormal until the crash and leave this:
2015-07-29 16:41:53.500 SGT,,,5512,,55b87f74.1588,2,,2015-07-29 15:23:32 SGT,,0,LOG,00000,"server process (PID 4416) was terminated by exception 0xC0000142",,"See C include file ""ntstatus.h"" for a description of the hexadecimal value.",,,,,,,""
2015-07-29 16:41:53.500 SGT,,,5512,,55b87f74.1588,3,,2015-07-29 15:23:32 SGT,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,""
2015-07-29 16:41:53.500 SGT,"eps","transactiondatabase",6960,"127.0.0.1:9162",55b891cf.1b30,9,"idle",2015-07-29 16:41:51 SGT,146/0,0,WARNING,57P00,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2015-07-29 16:41:53.515 SGT,"eps","transactiondatabase",5828,"127.0.0.1:9150",55b891c2.16c4,155,"idle",2015-07-29 16:41:38 SGT,145/0,0,WARNING,57P00,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2015-07-29 16:41:53.515 SGT,"eps","transactiondatabase",6448,"127.0.0.1:9148",55b891c2.1930,5,"idle",2015-07-29 16:41:38 SGT,93/0,0,WARNING,57P00,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
....
....
2015-07-29 16:41:54.500 SGT,,,8004,,55b87f76.1f44,2,,2015-07-29 15:23:34 SGT,1/0,0,WARNING,57P00,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,""
2015-07-29 16:41:54.515 SGT,,,5512,,55b87f74.1588,4,,2015-07-29 15:23:32 SGT,,0,LOG,00000,"all server processes terminated; reinitializing",,,,,,,,,""
2015-07-29 16:42:04.515 SGT,,,5512,,55b87f74.1588,5,,2015-07-29 15:23:32 SGT,,0,FATAL,XX000,"pre-existing shared memory block is still in use",,"Check if there are any old server processes still running, and terminate them.",,,,,,,""
2015-07-29 16:51:02.078 SGT,,,5828,,55b893f6.16c4,1,,2015-07-29 16:51:02 SGT,,0,LOG,00000,"database system was interrupted; last known up at 2015-07-29 16:40:36 SGT",,,,,,,,,""
2015-07-29 16:51:02.093 SGT,,,5828,,55b893f6.16c4,2,,2015-07-29 16:51:02 SGT,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2015-07-29 16:51:02.109 SGT,,,5828,,55b893f6.16c4,3,,2015-07-29 16:51:02 SGT,,0,LOG,00000,"redo starts at 0/12C79578",,,,,,,,,""
2015-07-29 16:51:02.421 SGT,,,5828,,55b893f6.16c4,4,,2015-07-29 16:51:02 SGT,,0,LOG,00000,"unexpected pageaddr 0/1046A000 in log file 0, segment 19, offset 4628480",,,,,,,,,""
2015-07-29 16:51:02.421 SGT,,,5828,,55b893f6.16c4,5,,2015-07-29 16:51:02 SGT,,0,LOG,00000,"redo done at 0/13469FC8",,,,,,,,,""
one thing I could point is the database configuration of shared_buffers, now the settings is 256MB, it just there for no reason, does it help to increase this value?
Other major setting: max_connections=200, temp_buffers = 16MB,work_mem = 8MB
Anyone could help to check how the crash happened, or how to minimize the scope?
MSDN says:
0xC0000142
STATUS_DLL_INIT_FAILED
{DLL Initialization Failed} Initialization of the dynamic link library %hs failed. The process is terminating abnormally.
so it was a DLL loading issue and/or issue starting a new process. If I had to guess I'd say you might have hit limits on the number of open files, number of running processes, etc on your XP Embedded system. You might want to lower max_connections.

error while trying to run MPI program with username

When I run program via:
myshell$] mpirun --hosts localhost,192.168.1.4 ./a.out
the program executes successfully. Now when I try to run:
myshell$] mpirun --hosts localhost,myac#192.168.1.4 ./a.out
openssh prompts for password. I get:
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(433)..............:
MPID_Init(176).....................: channel initialization failed
MPIDI_CH3_Init(70).................:
MPID_nem_init(286).................:
MPID_nem_tcp_init(108).............:
MPID_nem_tcp_get_business_card(354):
MPID_nem_tcp_init(313).............: gethostbyname failed, myac#192.168.1.4 (errno 1)
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0#myac] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0#myac] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#myac] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#myac] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec#myac] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec#myac] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec#myac] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
Why am I getting error when I am providing the username?
You could try specifying a username in your ssh config file (http://www.cyberciti.biz/faq/create-ssh-config-file-on-linux-unix/) instead of on the mpirun command line. That way perhaps mpirun would not be confused by the extra username part, which as far as I can see from the documentation it does not support. But ssh could, behind the scenes, use the username you specify in your ssh config file. And of course you'll want to set up SSH keys so you don't have to type a password.
I don't believe MPICH supports providing usernames in the --hosts value on the command line. You should try the host file based method described on the wiki. http://wiki.mpich.org/mpich/index.php/Using_the_Hydra_Process_Manager#Using_Hydra_on_Machines_with_Different_User_Names
For example:
shell$ cat hosts
donner user=foo
foo user=bar
shakey user=bar
terra user=foo

Resources