Start Riak crashing after 30 seconds - solr
$ riak start crashing after 30 seconds of its start. I'm having following (changes) settings in my riak.conf:
search = on
storage_backend = leveldb
riak_control = on
crash.log contains the following:
2016-06-30 14:49:38 =ERROR REPORT====
** Generic server yz_solr_proc terminating
** Last message in was {check_solr,0}
** When Server state == {state,"./data/yz",#Port<0.9441>,8093,8985}
** Reason for termination ==
** "solr didn't start in alloted time"
2016-06-30 14:49:38 =CRASH REPORT====
crasher:
initial call: yz_solr_proc:init/1
pid: <0.582.0>
registered_name: yz_solr_proc
exception exit: {"solr didn't start in alloted time",[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,744}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ancestors: [yz_solr_sup,yz_sup,<0.578.0>]
messages: [{'EXIT',#Port<0.9441>,normal}]
links: [<0.580.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 376
stack_size: 27
reductions: 16170
neighbours:
2016-06-30 14:49:38 =SUPERVISOR REPORT====
Supervisor: {local,yz_solr_sup}
Context: child_terminated
Reason: "solr didn't start in alloted time"
Offender: [{pid,<0.582.0>},{name,yz_solr_proc},{mfargs,{yz_solr_proc,start_link,["./data/yz","./data/yz_temp",8093,8985]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
2016-06-30 14:49:39 =ERROR REPORT====
** Generic server yz_solr_proc terminating
** Last message in was {#Port<0.12204>,{exit_status,1}}
** When Server state == {state,"./data/yz",#Port<0.12204>,8093,8985}
** Reason for termination ==
** {"solr OS process exited",1}
2016-06-30 14:49:39 =CRASH REPORT====
crasher:
initial call: yz_solr_proc:init/1
pid: <0.7631.0>
registered_name: yz_solr_proc
exception exit: {{"solr OS process exited",1},[{gen_server,terminate,6,[{file,"gen_server.erl"},{line,744}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
ancestors: [yz_solr_sup,yz_sup,<0.578.0>]
messages: [{'EXIT',#Port<0.12204>,normal}]
links: [<0.580.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 1598
stack_size: 27
reductions: 8968
neighbours:
2016-06-30 14:49:39 =SUPERVISOR REPORT====
Supervisor: {local,yz_solr_sup}
Context: child_terminated
Reason: {"solr OS process exited",1}
Offender: [{pid,<0.7631.0>},{name,yz_solr_proc},{mfargs,{yz_solr_proc,start_link,["./data/yz","./data/yz_temp",8093,8985]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
2016-06-30 14:49:39 =SUPERVISOR REPORT====
Supervisor: {local,yz_solr_sup}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.7631.0>},{name,yz_solr_proc},{mfargs,{yz_solr_proc,start_link,["./data/yz","./data/yz_temp",8093,8985]}},{restart_type,permanent},{shutdown,5000},{child_type,worker}]
2016-06-30 14:49:39 =SUPERVISOR REPORT====
Supervisor: {local,yz_sup}
Context: child_terminated
Reason: shutdown
Offender: [{pid,<0.580.0>},{name,yz_solr_sup},{mfargs,{yz_solr_sup,start_link,[]}},{restart_type,permanent},{shutdown,5000},{child_type,supervisor}]
2016-06-30 14:49:39 =SUPERVISOR REPORT====
Supervisor: {local,yz_sup}
Context: shutdown
Reason: reached_max_restart_intensity
Offender: [{pid,<0.580.0>},{name,yz_solr_sup},{mfargs,{yz_solr_sup,start_link,[]}},{restart_type,permanent},{shutdown,5000},{child_type,supervisor}]
Make sure the ports used by Solr are available. The defaults are 8093 for search, and 8985 for JMX.
Tune your system to improve performance. Follow Improving Performance for Linux.
In riak.conf, increase the JVM's heap size, the default of 1G is often not enough. For example, search.solr.jvm_options=-d64 -Xms2g -Xmx4g -XX:+UseStringCache -XX:+UseCompressedOops (see Search Settings).
On a slow machine, Solr just may take longer to start. Try increasing search.solr.start_timeout.
Solr directories must be writable (usually /var/lib/riak/data/yz*), and a compatible JVM be used.
Riak's internal solr use localhost and 127.0.0.1 as default host. So it should have defined in /etc/hosts file:
127.0.0.1 localhost
FYI, if you use windows your hosts file location could be different.
Related
Azure SQL Edge container failed to start on M1, when mapping volume to relative path
On an M1 Macbook, I followed online examples and successfully start Azure SQL Edge container with basic configuration. Then I want to map a volume (mySpecialFolder) by "Path to the host, relative to the Compose file". Here we want "./mySpecialFolder:/tmp", not "mySpecialFolder:/tmp". services: mssql: container_name: mssql image: "mcr.microsoft.com/azure-sql-edge:latest" environment: SA_PASSWORD: "something" ACCEPT_EULA: "Y" expose: - 1433 ports: - 1433:1433 networks: - sql volumes: - ./mySpecialFolder:/tmp - mssqlsystem:/var/opt/mssql It failed to load and reports Azure SQL Edge will run as non-root by default. This container is running as user mssql. To learn more visit https://go.microsoft.com/fwlink/?linkid=2140520. 2022/07/29 11:00:39 [launchpadd] INFO: Extensibility Log Header: <timestamp> <process> <sandboxId> <sessionId> <message> 2022/07/29 11:00:39 [launchpadd] WARNING: Failed to load /var/opt/mssql/mssql.conf ini file with error open /var/opt/mssql/mssql.conf: no such file or directory 2022/07/29 11:00:39 [launchpadd] INFO: DataDirectories = /bin:/etc:/lib:/lib32:/lib64:/sbin:/usr/bin:/usr/include:/usr/lib:/usr/lib32:/usr/lib64:/usr/libexec/gcc:/usr/sbin:/usr/share:/var/lib:/opt/microsoft:/opt/mssql-extensibility:/opt/mssql/mlservices:/opt/mssql/lib/zulu-jre-11:/opt/mssql-tools 2022/07/29 11:00:39 Drop permitted effective capabilities. 2022/07/29 11:00:39 [launchpadd] INFO: Polybase remote hadoop bridge disabled 2022/07/29 11:00:39 [launchpadd] INFO: Launchpadd is connecting to mssql on localhost:1431 2022/07/29 11:00:39 [launchpadd] WARNING: Failed to connect to SQL because: dial tcp 127.0.0.1:1431: connect: connection refused, will reattempt connection. This program has encountered a fatal error and cannot continue running at Fri Jul 29 11:00:40 2022 The following diagnostic information is available: Reason: 0x00000007 Status: 0xc0000002 Message: Failed to load KM driver [Npfs] Stack Trace: file://package4/windows/system32/sqlpal.dll+0x000000000030E879 file://package4/windows/system32/sqlpal.dll+0x000000000030DB54 file://package4/windows/system32/sqlpal.dll+0x000000000030AB96 file://package4/windows/system32/sqlpal.dll+0x000000000030961D file://package4/windows/system32/sqlpal.dll+0x000000000034EE01 Stack: IP Function ---------------- -------------------------------------- 0000aaaac9c2ba70 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::~_Sp_counted_base()+0x25d0 0000aaaac9c2b618 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::~_Sp_counted_base()+0x2178 0000aaaac9c39d74 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::~_Sp_counted_base()+0x108d4 0000aaaac9c3a75c std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::~_Sp_counted_base()+0x112bc 0000aaaac9ced6c4 std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > std::operator+<char, std::char_traits<char>, std::allocator<char> >(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_ 0000ffffb9e44df8 S_SbtUnimplementedInstruction+0x2542b4 0000ffffb9e4472c S_SbtUnimplementedInstruction+0x253be8 0000ffffb9e45238 S_SbtUnimplementedInstruction+0x2546f4 0000ffffb9e3ca90 S_SbtUnimplementedInstruction+0x24bf4c 0000ffffb9e395dc S_SbtUnimplementedInstruction+0x248a98 0000ffffb9ed8ddc S_SbtUnimplementedInstruction+0x2e8298 0000ffffb9e38e44 S_SbtUnimplementedInstruction+0x248300 0000ffffb9e38b98 S_SbtUnimplementedInstruction+0x248054 0000ffffb9e38604 S_SbtUnimplementedInstruction+0x247ac0 0000ffffb9e38ffc S_SbtUnimplementedInstruction+0x2484b8 0000ffffbdb248a4 CallGuestFunction+0x84 0000ffffbdb1f964 Sbt::Dispatcher::SimulateCpu(Sbt::GuestCtx*)+0x2c 0000ffffbdb20d9c Sbt::RuntimeImpl::SimulateCpu(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x3c8 0000ffffbdb219e4 Sbt::SimulateCpu(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long)+0x30 0000ffffbdb22c04 SbtRtSimulateCpu+0x84 0000aaaac9c42164 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::~_Sp_counted_base()+0x18cc4 0000aaaac9c3fe34 std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::~_Sp_counted_base()+0x16994 Process: 24 - sqlservr Thread: 28 (application thread 0x4) Instance Id: 76bd6c34-28e2-4a7f-9e5a-f3ffa17d9c1a Crash Id: a022551e-96fe-4a59-ada3-4da01d244653 Build stamp: 06cd67626d2ebedd8721dc1bd892cdda65157cdcd6ac004bb81acdd6498ec618 Distribution: Ubuntu 18.04.6 LTS aarch64 Processors: 5 Total Memory: 8232747008 bytes Timestamp: Fri Jul 29 11:00:40 2022 Last errno: 2 Last errno text: No such file or directory
PiHole with Recursive DNS not Handshaking with Wireguard setup via PiVPN
Expected Behaviour: I've set my Router to use my PiHole along with Wireguard to use it as a VPN. I've set it up using PIVPN and some tutorials on Youtube. I have included Screenshots of my router and it's setup along with my Wireguard config files and setup. My PiHole is set up to use Recursive DNS and I have set up a DDNS with my Router and made sure to disable my Router's inherent DHCP service, set the PIHole as my Primary DNS and reserve the address. My PiHole is working nicely, but none of my devices are connecting to the Wireguard VPN. Actual Behaviour: My Phone/Mac should be handshaking with the VPN but it's not. My Pi-Hole is working correctly but the Wireguard is not. I have been working on this for the better part of the day and am utterly at a loss, any help whatsoever would be greatly appreciated. thanks! The three main Youtube videos I used to help me set this up were: For the Wireguard and Pi-Hole interaction https://www.youtube.com/watch?v=DUpIOSbbvKk&t=595s https://www.youtube.com/watch?v=lnYYmC-A4S0 For my Recursive Pi-Hole DNS server https://www.youtube.com/watch?v=FnFtWsZ8IP0&t=939s I did end up using PIVPN to set things up PI Ifconfig Results pi#raspberrypi:~ $ ifconfig eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.0.155 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 fe80::xxx:xxxx:xxxx:xxxx prefixlen 64 scopeid 0x20<link> ether xx:xx:xx:xx:xx:xx txqueuelen 1000 (Ethernet) RX packets 8349 bytes 1644604 (1.5 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 3440 bytes 943688 (921.5 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 1388 bytes 123499 (120.6 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 1388 bytes 123499 (120.6 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 wg0: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1420 inet 10.6.0.1 netmask 255.255.255.0 destination 10.6.0.1 unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 58 overruns 0 carrier 0 collisions 0 wlan0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500 ether xx:xx:xx:xx:xx:xx txqueuelen 1000 (Ethernet) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 Debug Token: *** [ INITIALIZING ] [i] 2022-01-30:11:31:18 debug log has been initialized. [i] System has been running for 0 days, 0 hours, 46 minutes *** [ INITIALIZING ] Sourcing setup variables [i] Sourcing /etc/pihole/setupVars.conf... *** [ DIAGNOSING ]: Core version [i] Core: v5.8.1 (https://discourse.pi-hole.net/t/how-do-i-update-pi-hole/249) [i] Remotes: origin https://github.com/pi-hole/pi-hole.git (fetch) origin https://github.com/pi-hole/pi-hole.git (push) [i] Branch: master [i] Commit: v5.8.1-0-g875ad04 *** [ DIAGNOSING ]: Web version [i] Web: v5.10.1 (https://discourse.pi-hole.net/t/how-do-i-update-pi-hole/249) [i] Remotes: origin https://github.com/pi-hole/AdminLTE.git (fetch) origin https://github.com/pi-hole/AdminLTE.git (push) [i] Branch: master [i] Commit: v5.10.1-0-gcb7a866 *** [ DIAGNOSING ]: FTL version [✓] FTL: v5.13 *** [ DIAGNOSING ]: lighttpd version [i] 1.4.59 *** [ DIAGNOSING ]: php version [i] 7.4.25 *** [ DIAGNOSING ]: Operating system [i] dig return code: 0 [i] dig response: "Raspbian=9,10,11 Ubuntu=16,18,20,21 Debian=9,10,11 Fedora=33,34 CentOS=7,8" [✓] Distro: Raspbian [✓] Version: 11 *** [ DIAGNOSING ]: SELinux [i] SELinux not detected *** [ DIAGNOSING ]: FirewallD [i] Firewalld service inactive *** [ DIAGNOSING ]: Processor [✓] armv7l *** [ DIAGNOSING ]: Disk usage Filesystem Size Used Avail Use% Mounted on /dev/root 29G 1.6G 26G 6% / devtmpfs 333M 0 333M 0% /dev tmpfs 462M 1.1M 461M 1% /dev/shm tmpfs 185M 716K 184M 1% /run tmpfs 5.0M 4.0K 5.0M 1% /run/lock /dev/mmcblk0p1 253M 50M 203M 20% /boot tmpfs 93M 0 93M 0% /run/user/999 tmpfs 93M 0 93M 0% /run/user/1000 *** [ DIAGNOSING ]: Networking [✓] IPv4 address(es) bound to the eth0 interface: 192.168.0.155/24 [✓] IPv6 address(es) bound to the eth0 interface: fe80::a7c:c1a2:460f:f20b/64 [i] Default IPv4 gateway: 192.168.0.1 * Pinging 192.168.0.1... [✓] Gateway responded. *** [ DIAGNOSING ]: Ports in use [✓] udp:0.0.0.0:53 is in use by pihole-FTL udp:0.0.0.0:68 is in use by dhcpcd udp:0.0.0.0:51820 is in use by <unknown> udp:127.0.0.1:5335 is in use by unbound udp:0.0.0.0:5353 is in use by avahi-daemon udp:0.0.0.0:51038 is in use by avahi-daemon [✓] udp:*:53 is in use by pihole-FTL udp:*:51820 is in use by <unknown> udp:*:5353 is in use by avahi-daemon udp:*:37789 is in use by avahi-daemon [✓] tcp:127.0.0.1:4711 is in use by pihole-FTL [✓] tcp:0.0.0.0:80 is in use by lighttpd [✓] tcp:0.0.0.0:53 is in use by pihole-FTL tcp:0.0.0.0:22 is in use by sshd tcp:127.0.0.1:5335 is in use by unbound tcp:127.0.0.1:8953 is in use by unbound [✓] tcp:[::1]:4711 is in use by pihole-FTL [✓] tcp:[::]:80 is in use by lighttpd [✓] tcp:[::]:53 is in use by pihole-FTL tcp:[::]:22 is in use by sshd *** [ DIAGNOSING ]: Name resolution (IPv4) using a random blocked domain and a known ad-serving domain [✓] mail.chileexe77.com is 0.0.0.0 on lo (127.0.0.1) [✓] mail.chileexe77.com is 0.0.0.0 on eth0 (192.168.0.155) [✓] No IPv4 address available on wlan0 [✓] mail.chileexe77.com is 0.0.0.0 on wg0 (10.6.0.1) [✓] doubleclick.com is 172.217.15.238 via a remote, public DNS server (8.8.8.8) *** [ DIAGNOSING ]: Name resolution (IPv6) using a random blocked domain and a known ad-serving domain [✓] file.firefoxupdata.com is :: on lo (::1) [✓] file.firefoxupdata.com is :: on eth0 (fe80::a7c:c1a2:460f:f20b) [✓] No IPv6 address available on wlan0 [✓] No IPv6 address available on wg0 [✗] Failed to resolve doubleclick.com via a remote, public DNS server (2001:4860:4860::8888) *** [ DIAGNOSING ]: Discovering active DHCP servers (takes 10 seconds) Scanning all your interfaces for DHCP servers Timeout: 10 seconds WARN: Could not sendto() in send_dhcp_discover() (/__w/FTL/FTL/src/dhcp-discover.c:233): Operation not permitted DHCP packets received on interface wlan0: 0 DHCP packets received on interface eth0: 0 DHCP packets received on interface lo: 0 *** [ DIAGNOSING ]: Pi-hole processes [✓] lighttpd daemon is active [✓] pihole-FTL daemon is active *** [ DIAGNOSING ]: Pi-hole-FTL full status ● pihole-FTL.service - LSB: pihole-FTL daemon Loaded: loaded (/etc/init.d/pihole-FTL; generated) Active: active (exited) since Sun 2022-01-30 10:44:27 MST; 47min ago Docs: man:systemd-sysv-generator(8) Process: 637 ExecStart=/etc/init.d/pihole-FTL start (code=exited, status=0/SUCCESS) CPU: 143ms Jan 30 10:44:24 raspberrypi systemd[1]: Starting LSB: pihole-FTL daemon... Jan 30 10:44:25 raspberrypi pihole-FTL[637]: Not running Jan 30 10:44:25 raspberrypi su[665]: (to pihole) root on none Jan 30 10:44:25 raspberrypi su[665]: pam_unix(su:session): session opened for user pihole(uid=999) by (uid=0) Jan 30 10:44:27 raspberrypi pihole-FTL[738]: FTL started! Jan 30 10:44:27 raspberrypi systemd[1]: Started LSB: pihole-FTL daemon. *** [ DIAGNOSING ]: Setup variables PIHOLE_INTERFACE=eth0 IPV4_ADDRESS=192.168.0.155/24 IPV6_ADDRESS= QUERY_LOGGING=true INSTALL_WEB_SERVER=true INSTALL_WEB_INTERFACE=true LIGHTTPD_ENABLED=true CACHE_SIZE=10000 BLOCKING_ENABLED=true PIHOLE_DNS_1=127.0.0.1#5335 DNS_FQDN_REQUIRED=true DNS_BOGUS_PRIV=true DNSSEC=false REV_SERVER=false DNSMASQ_LISTENING=local *** [ DIAGNOSING ]: Dashboard and block page [✗] Block page X-Header: X-Header does not match or could not be retrieved. HTTP/1.1 200 OK Content-type: text/html; charset=UTF-8 Expires: Sun, 30 Jan 2022 18:31:35 GMT Cache-Control: max-age=0 Date: Sun, 30 Jan 2022 18:31:35 GMT Server: lighttpd/1.4.59 [✓] Web interface X-Header: X-Pi-hole: The Pi-hole Web interface is working! *** [ DIAGNOSING ]: Gravity Database -rw-rw-r-- 1 pihole pihole 220K Jan 30 03:21 /etc/pihole/gravity.db *** [ DIAGNOSING ]: Info table property value -------------------- ---------------------------------------- version 15 updated 1643538072 gravity_count 2046 Last gravity run finished at: Sun 30 Jan 2022 03:21:12 AM MST ----- First 10 Gravity Domains ----- advanbusiness.com aoldaily.com aolon1ine.com applesoftupdate.com arrowservice.net attnpower.com aunewsonline.com avvmail.com bigdepression.net bigish.net *** [ DIAGNOSING ]: Groups id enabled name date_added date_modified description ---- ------- -------------------------------------------------- ------------------- ------------------- -------------------------------------------------- 0 1 Default 2022-01-30 01:54:48 2022-01-30 01:54:48 The default group *** [ DIAGNOSING ]: Domainlist (0/1 = exact white-/blacklist, 2/3 = regex white-/blacklist) *** [ DIAGNOSING ]: Clients *** [ DIAGNOSING ]: Adlists id enabled group_ids address date_added date_modified comment ----- ------- ------------ ---------------------------------------------------------------------------------------------------- ------------------- ------------------- -------------------------------------------------- 2 1 0 http://www.malwaredomainlist.com/hostslist/hosts.txt 2022-01-30 02:05:09 2022-01-30 02:05:09 *** [ DIAGNOSING ]: contents of /etc/pihole -rw-r--r-- 1 root root 0 Jan 30 01:54 /etc/pihole/custom.list -rw-r--r-- 1 root root 65 Jan 30 03:21 /etc/pihole/local.list -rw-r--r-- 1 root root 234 Jan 30 01:54 /etc/pihole/logrotate /var/log/pihole.log { su root root daily copytruncate rotate 5 compress delaycompress notifempty nomail } /var/log/pihole-FTL.log { su root root weekly copytruncate rotate 3 compress delaycompress notifempty nomail } -rw-rw-r-- 1 pihole root 127 Jan 30 01:54 /etc/pihole/pihole-FTL.conf PRIVACYLEVEL=0 *** [ DIAGNOSING ]: contents of /etc/dnsmasq.d -rw-r--r-- 1 root root 1.4K Jan 30 02:14 /etc/dnsmasq.d/01-pihole.conf addn-hosts=/etc/pihole/local.list addn-hosts=/etc/pihole/custom.list localise-queries no-resolv cache-size=10000 log-queries log-facility=/var/log/pihole.log log-async server=127.0.0.1#5335 domain-needed expand-hosts bogus-priv local-service -rw-r--r-- 1 root root 38 Jan 30 02:14 /etc/dnsmasq.d/02-pivpn.conf addn-hosts=/etc/pivpn/hosts.wireguard -rw-r--r-- 1 root root 2.2K Jan 30 01:54 /etc/dnsmasq.d/06-rfc6761.conf server=/test/ server=/localhost/ server=/invalid/ server=/bind/ server=/onion/ *** [ DIAGNOSING ]: contents of /etc/lighttpd -rw-r--r-- 1 root root 0 Jan 30 01:54 /etc/lighttpd/external.conf -rw-r--r-- 1 root root 3.7K Jan 30 01:54 /etc/lighttpd/lighttpd.conf server.modules = ( "mod_access", "mod_accesslog", "mod_auth", "mod_expire", "mod_redirect", "mod_setenv", "mod_rewrite" ) server.document-root = "/var/www/html" server.error-handler-404 = "/pihole/index.php" server.upload-dirs = ( "/var/cache/lighttpd/uploads" ) server.errorlog = "/var/log/lighttpd/error.log" server.pid-file = "/run/lighttpd.pid" server.username = "www-data" server.groupname = "www-data" server.port = 80 accesslog.filename = "/var/log/lighttpd/access.log" accesslog.format = "%{%s}t|%V|%r|%s|%b" index-file.names = ( "index.php", "index.html", "index.lighttpd.html" ) url.access-deny = ( "~", ".inc", ".md", ".yml", ".ini" ) static-file.exclude-extensions = ( ".php", ".pl", ".fcgi" ) mimetype.assign = ( ".ico" => "image/x-icon", ".jpeg" => "image/jpeg", ".jpg" => "image/jpeg", ".png" => "image/png", ".svg" => "image/svg+xml", ".css" => "text/css; charset=utf-8", ".html" => "text/html; charset=utf-8", ".js" => "text/javascript; charset=utf-8", ".json" => "application/json; charset=utf-8", ".map" => "application/json; charset=utf-8", ".txt" => "text/plain; charset=utf-8", ".eot" => "application/vnd.ms-fontobject", ".otf" => "font/otf", ".ttc" => "font/collection", ".ttf" => "font/ttf", ".woff" => "font/woff", ".woff2" => "font/woff2" ) include_shell "cat external.conf 2>/dev/null" include_shell "/usr/share/lighttpd/use-ipv6.pl " + server.port include_shell "find /etc/lighttpd/conf-enabled -name '*.conf' -a ! -name 'letsencrypt.conf' -printf 'include \"%p\" ' 2>/dev/null" $HTTP["url"] =~ "^/admin/" { setenv.add-response-header = ( "X-Pi-hole" => "The Pi-hole Web interface is working!", "X-Frame-Options" => "DENY" ) } $HTTP["url"] =~ "^/admin/\.(.*)" { url.access-deny = ("") } $HTTP["url"] =~ "/(teleporter|api_token)\.php$" { $HTTP["referer"] =~ "/admin/settings\.php" { setenv.add-response-header = ( "X-Frame-Options" => "SAMEORIGIN" ) } } expire.url = ( "" => "access plus 0 seconds" ) *** [ DIAGNOSING ]: contents of /etc/cron.d -rw-r--r-- 1 root root 1.8K Jan 30 01:54 /etc/cron.d/pihole 21 3 * * 7 root PATH="$PATH:/usr/sbin:/usr/local/bin/" pihole updateGravity >/var/log/pihole_updateGravity.log || cat /var/log/pihole_updateGravity.log 00 00 * * * root PATH="$PATH:/usr/sbin:/usr/local/bin/" pihole flush once quiet #reboot root /usr/sbin/logrotate --state /var/lib/logrotate/pihole /etc/pihole/logrotate */10 * * * * root PATH="$PATH:/usr/sbin:/usr/local/bin/" pihole updatechecker local 34 16 * * * root PATH="$PATH:/usr/sbin:/usr/local/bin/" pihole updatechecker remote #reboot root PATH="$PATH:/usr/sbin:/usr/local/bin/" pihole updatechecker remote reboot *** [ DIAGNOSING ]: contents of /var/log/lighttpd -rw-r--r-- 1 www-data www-data 770 Jan 30 10:44 /var/log/lighttpd/error.log -----head of error.log------ 2022-01-30 01:53:27: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 01:54:35: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 01:54:35: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 01:59:39: Wrong token! Please re-login on the Pi-hole dashboard. 2022-01-30 02:06:15: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 02:06:53: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 02:17:36: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 02:18:02: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 09:17:22: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 10:44:02: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 10:44:25: server.c.1513) server started (lighttpd/1.4.59) -----tail of error.log------ 2022-01-30 01:53:27: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 01:54:35: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 01:54:35: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 01:59:39: Wrong token! Please re-login on the Pi-hole dashboard. 2022-01-30 02:06:15: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 02:06:53: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 02:17:36: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 02:18:02: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 09:17:22: server.c.1513) server started (lighttpd/1.4.59) 2022-01-30 10:44:02: server.c.1976) server stopped by UID = 0 PID = 1 2022-01-30 10:44:25: server.c.1513) server started (lighttpd/1.4.59) *** [ DIAGNOSING ]: contents of /var/log -rw-r--r-- 1 pihole pihole 55K Jan 30 11:00 /var/log/pihole-FTL.log -----head of pihole-FTL.log------ [2022-01-30 01:54:42.959 11980M] Using log file /var/log/pihole-FTL.log [2022-01-30 01:54:42.959 11980M] ########## FTL started on raspberrypi! ########## [2022-01-30 01:54:42.959 11980M] FTL branch: master [2022-01-30 01:54:42.959 11980M] FTL version: v5.13 [2022-01-30 01:54:42.959 11980M] FTL commit: b197b69 [2022-01-30 01:54:42.959 11980M] FTL date: 2022-01-05 18:19:34 +0000 [2022-01-30 01:54:42.959 11980M] FTL user: pihole [2022-01-30 01:54:42.959 11980M] Compiled for armv7hf (compiled on CI) using arm-linux-gnueabihf-gcc (Debian 6.3.0-18) 6.3.0 20170516 [2022-01-30 01:54:42.959 11980M] Creating mutex [2022-01-30 01:54:42.959 11980M] Creating mutex [2022-01-30 01:54:42.961 11980M] Starting config file parsing (/etc/pihole/pihole-FTL.conf) [2022-01-30 01:54:42.961 11980M] SOCKET_LISTENING: only local [2022-01-30 01:54:42.961 11980M] AAAA_QUERY_ANALYSIS: Show AAAA queries [2022-01-30 01:54:42.961 11980M] MAXDBDAYS: max age for stored queries is 365 days [2022-01-30 01:54:42.961 11980M] RESOLVE_IPV6: Resolve IPv6 addresses [2022-01-30 01:54:42.961 11980M] RESOLVE_IPV4: Resolve IPv4 addresses [2022-01-30 01:54:42.962 11980M] DBINTERVAL: saving to DB file every minute [2022-01-30 01:54:42.962 11980M] DBFILE: Using /etc/pihole/pihole-FTL.db [2022-01-30 01:54:42.962 11980M] MAXLOGAGE: Importing up to 24.0 hours of log data [2022-01-30 01:54:42.962 11980M] PRIVACYLEVEL: Set to 0 [2022-01-30 01:54:42.962 11980M] IGNORE_LOCALHOST: Show queries from localhost [2022-01-30 01:54:42.962 11980M] BLOCKINGMODE: Null IPs for blocked domains [2022-01-30 01:54:42.962 11980M] ANALYZE_ONLY_A_AND_AAAA: Disabled. Analyzing all queries [2022-01-30 01:54:42.962 11980M] DBIMPORT: Importing history from database [2022-01-30 01:54:42.962 11980M] PIDFILE: Using /run/pihole-FTL.pid [2022-01-30 01:54:42.962 11980M] PORTFILE: Using /run/pihole-FTL.port [2022-01-30 01:54:42.962 11980M] SOCKETFILE: Using /run/pihole/FTL.sock [2022-01-30 01:54:42.962 11980M] SETUPVARSFILE: Using /etc/pihole/setupVars.conf [2022-01-30 01:54:42.962 11980M] MACVENDORDB: Using /etc/pihole/macvendor.db [2022-01-30 01:54:42.962 11980M] GRAVITYDB: Using /etc/pihole/gravity.db [2022-01-30 01:54:42.962 11980M] PARSE_ARP_CACHE: Active [2022-01-30 01:54:42.962 11980M] CNAME_DEEP_INSPECT: Active [2022-01-30 01:54:42.963 11980M] DELAY_STARTUP: No delay requested. [2022-01-30 01:54:42.963 11980M] BLOCK_ESNI: Enabled, blocking _esni.{blocked domain} [2022-01-30 01:54:42.963 11980M] NICE: Set process niceness to -10 (default) -----tail of pihole-FTL.log------ [2022-01-30 10:44:26.702 738M] ADDR2LINE: Enabled [2022-01-30 10:44:26.702 738M] REPLY_WHEN_BUSY: Permit queries when the database is busy [2022-01-30 10:44:26.702 738M] BLOCK_TTL: 2 seconds [2022-01-30 10:44:26.702 738M] BLOCK_ICLOUD_PR: Enabled [2022-01-30 10:44:26.702 738M] CHECK_LOAD: Enabled [2022-01-30 10:44:26.702 738M] CHECK_SHMEM: Warning if shared-memory usage exceeds 90% [2022-01-30 10:44:26.702 738M] CHECK_DISK: Warning if certain disk usage exceeds 90% [2022-01-30 10:44:26.702 738M] Finished config file parsing [2022-01-30 10:44:26.707 738M] Database version is 9 [2022-01-30 10:44:26.708 738M] Resizing "FTL-strings" from 40960 to (81920 * 1) == 81920 (/dev/shm: 1.1MB used, 483.8MB total, FTL uses 1.1MB) [2022-01-30 10:44:26.710 738M] Imported 0 alias-clients [2022-01-30 10:44:26.710 738M] Database successfully initialized [2022-01-30 10:44:27.558 738M] New upstream server: 127.0.0.1:5335 (0/256) [2022-01-30 10:44:27.570 738M] Imported 207 queries from the long-term database [2022-01-30 10:44:27.571 738M] -> Total DNS queries: 207 [2022-01-30 10:44:27.571 738M] -> Cached DNS queries: 67 [2022-01-30 10:44:27.571 738M] -> Forwarded DNS queries: 140 [2022-01-30 10:44:27.571 738M] -> Blocked DNS queries: 0 [2022-01-30 10:44:27.571 738M] -> Unknown DNS queries: 0 [2022-01-30 10:44:27.571 738M] -> Unique domains: 44 [2022-01-30 10:44:27.571 738M] -> Unique clients: 5 [2022-01-30 10:44:27.572 738M] -> Known forward destinations: 1 [2022-01-30 10:44:27.572 738M] Successfully accessed setupVars.conf [2022-01-30 10:44:27.579 738M] listening on 0.0.0.0 port 53 [2022-01-30 10:44:27.579 738M] listening on :: port 53 [2022-01-30 10:44:27.586 741M] PID of FTL process: 741 [2022-01-30 10:44:27.588 741/T742] Listening on port 4711 for incoming IPv4 telnet connections [2022-01-30 10:44:27.589 741M] INFO: FTL is running as user pihole (UID 999) [2022-01-30 10:44:27.589 741/T744] Listening on Unix socket [2022-01-30 10:44:27.591 741/T743] Listening on port 4711 for incoming IPv6 telnet connections [2022-01-30 10:44:27.603 741M] Reloading DNS cache [2022-01-30 10:44:28.601 741/T745] Compiled 0 whitelist and 0 blacklist regex filters for 5 clients in 2.7 msec [2022-01-30 10:44:29.597 741M] Blocking status is enabled [2022-01-30 11:00:01.881 741/T747] SQLite3 message: database is locked in "SELECT name FROM network_addresses WHERE name IS NOT NULL AND ip = ?;" (5) [2022-01-30 11:00:01.881 741/T747] getNameFromIP("192.168.0.128") - SQL error prepare: database is locked *** [ DIAGNOSING ]: contents of /dev/shm -rw------- 1 pihole pihole 668K Jan 30 11:31 /dev/shm/FTL-clients -rw------- 1 pihole pihole 240 Jan 30 10:44 /dev/shm/FTL-counters -rw------- 1 pihole pihole 4.0K Jan 30 10:44 /dev/shm/FTL-dns-cache -rw------- 1 pihole pihole 4.0K Jan 30 10:44 /dev/shm/FTL-domains -rw------- 1 pihole pihole 56 Jan 30 10:44 /dev/shm/FTL-lock -rw------- 1 pihole pihole 12K Jan 30 10:44 /dev/shm/FTL-overTime -rw------- 1 pihole pihole 4.0K Jan 30 10:44 /dev/shm/FTL-per-client-regex -rw------- 1 pihole pihole 176K Jan 30 10:44 /dev/shm/FTL-queries -rw------- 1 pihole pihole 12 Jan 30 10:44 /dev/shm/FTL-settings -rw------- 1 pihole pihole 80K Jan 30 10:44 /dev/shm/FTL-strings -rw------- 1 pihole pihole 156K Jan 30 10:44 /dev/shm/FTL-upstreams *** [ DIAGNOSING ]: contents of /etc -rw-r--r-- 1 root root 24 Jan 30 01:54 /etc/dnsmasq.conf conf-dir=/etc/dnsmasq.d -rw-r--r-- 1 root root 47 Jan 30 10:44 /etc/resolv.conf nameserver 127.0.0.1 *** [ DIAGNOSING ]: Pi-hole diagnosis messages *** [ DIAGNOSING ]: Locale LANG=en_US.UTF-8 *** [ DIAGNOSING ]: Pi-hole log -rw-r--r-- 1 pihole pihole 88K Jan 30 11:31 /var/log/pihole.log -----head of pihole.log------ Jan 30 01:54:48 dnsmasq[11982]: started, version pi-hole-2.87test4-18 cachesize 10000 Jan 30 01:54:48 dnsmasq[11982]: DNS service limited to local subnets Jan 30 01:54:48 dnsmasq[11982]: compile time options: IPv6 GNU-getopt no-DBus no-UBus no-i18n IDN DHCP DHCPv6 Lua TFTP no-conntrack ipset no-nftset auth cryptohash DNSSEC loop-detect inotify dumpfile Jan 30 01:54:48 dnsmasq[11982]: using nameserver 127.0.0.1#5335 Jan 30 01:54:48 dnsmasq[11982]: using nameserver 127.0.0.1#5335 Jan 30 01:54:48 dnsmasq[11982]: using only locally-known addresses for onion Jan 30 01:54:48 dnsmasq[11982]: using only locally-known addresses for bind Jan 30 01:54:48 dnsmasq[11982]: using only locally-known addresses for invalid Jan 30 01:54:48 dnsmasq[11982]: using only locally-known addresses for localhost Jan 30 01:54:48 dnsmasq[11982]: using only locally-known addresses for test Jan 30 01:54:48 dnsmasq[11982]: read /etc/hosts - 5 addresses Jan 30 01:54:48 dnsmasq[11982]: read /etc/pihole/custom.list - 0 addresses Jan 30 01:54:48 dnsmasq[11982]: failed to load names from /etc/pihole/local.list: No such file or directory Jan 30 02:00:04 dnsmasq[11982]: exiting on receipt of SIGTERM Jan 30 02:00:07 dnsmasq[13277]: started, version pi-hole-2.87test4-18 cachesize 10000 Jan 30 02:00:07 dnsmasq[13277]: DNS service limited to local subnets Jan 30 02:00:07 dnsmasq[13277]: compile time options: IPv6 GNU-getopt no-DBus no-UBus no-i18n IDN DHCP DHCPv6 Lua TFTP no-conntrack ipset no-nftset auth cryptohash DNSSEC loop-detect inotify dumpfile Jan 30 02:00:07 dnsmasq[13277]: using nameserver 127.0.0.1#5335 Jan 30 02:00:07 dnsmasq[13277]: using only locally-known addresses for onion Jan 30 02:00:07 dnsmasq[13277]: using only locally-known addresses for bind -----tail of pihole.log------ Jan 30 11:31:20 dnsmasq[741]: query[AAAA] ns1.pi-hole.net from 127.0.0.1 Jan 30 11:31:20 dnsmasq[741]: forwarded ns1.pi-hole.net to 127.0.0.1 Jan 30 11:31:20 dnsmasq[741]: reply ns1.pi-hole.net is 205.251.193.151 Jan 30 11:31:20 dnsmasq[741]: reply ns1.pi-hole.net is 2600:9000:5301:9700::1 Jan 30 11:31:22 dnsmasq[741]: query[A] mail.chileexe77.com from 127.0.0.1 Jan 30 11:31:22 dnsmasq[741]: gravity blocked mail.chileexe77.com is 0.0.0.0 Jan 30 11:31:22 dnsmasq[741]: query[A] mail.chileexe77.com from 192.168.0.155 Jan 30 11:31:22 dnsmasq[741]: gravity blocked mail.chileexe77.com is 0.0.0.0 Jan 30 11:31:22 dnsmasq[741]: query[A] mail.chileexe77.com from 10.6.0.1 Jan 30 11:31:22 dnsmasq[741]: gravity blocked mail.chileexe77.com is 0.0.0.0 Jan 30 11:31:22 dnsmasq[741]: query[PTR] 155.0.168.192.in-addr.arpa from 127.0.0.1 Jan 30 11:31:23 dnsmasq[741]: config 155.0.168.192.in-addr.arpa is <PTR> Jan 30 11:31:23 dnsmasq[741]: query[PTR] 1.0.6.10.in-addr.arpa from 127.0.0.1 Jan 30 11:31:23 dnsmasq[741]: config 1.0.6.10.in-addr.arpa is <PTR> Jan 30 11:31:23 dnsmasq[741]: query[AAAA] file.firefoxupdata.com from ::1 Jan 30 11:31:23 dnsmasq[741]: gravity blocked file.firefoxupdata.com is :: Jan 30 11:31:23 dnsmasq[741]: query[AAAA] file.firefoxupdata.com from fe80::a7c:c1a2:460f:f20b Jan 30 11:31:23 dnsmasq[741]: gravity blocked file.firefoxupdata.com is :: Jan 30 11:31:24 dnsmasq[741]: query[PTR] b.0.2.f.f.0.6.4.2.a.1.c.c.7.a.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa from 127.0.0.1 Jan 30 11:31:24 dnsmasq[741]: config b.0.2.f.f.0.6.4.2.a.1.c.c.7.a.0.0.0.0.0.0.0.0.0.0.0.0.0.0.8.e.f.ip6.arpa is <PTR> ******************************************** ******************************************** [✓] ** FINISHED DEBUGGING! **
Paragraph execution in zeppelin goes to pending state after some time and hadoop application status for zeppelin is finished
I am using zeppelin from the last 3 months and noticed this strange problem recently. Everyday morning I had to restart zeppelin for it to work or else the paragraph execution will go to pending state and never run. I tried to dig deeper to check what is the problem. The state of the zeppelin application in yarn is finshed. I tried to check the log and it shows the below error. Couldn't make out anything out of it. 2017-06-28 22:04:08,986 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 56876 for container-id container_1498627544571_0001_01_000002: 1.2 GB of 4 GB physical memory used; 4.0 GB of 20 GB virtual memory used 2017-06-28 22:04:08,995 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 56787 for container-id container_1498627544571_0001_01_000001: 330.2 MB of 1 GB physical memory used; 1.4 GB of 5 GB virtual memory used 2017-06-28 22:04:09,964 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1498627544571_0001_01_000002 is : 1 2017-06-28 22:04:09,965 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1498627544571_0001_01_000002 and exit code: 1 ExitCodeException exitCode=1: at org.apache.hadoop.util.Shell.runCommand(Shell.java:545) at org.apache.hadoop.util.Shell.run(Shell.java:456) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch. 2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1498627544571_0001_01_000002 2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1 I am the only user in that environment and no one else is using it. There isn't any process running at that time as well. Couldn't understand why it is happening.
Flink Streaming Job is Failed Automatically
I am running a flink streaming job with parallelism 1 . Suddenly after 8 hours job failed . It showed Association with remote system [akka.tcp://flink#192.168.3.153:44863] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 2017-04-12 00:48:36,683 INFO org.apache.flink.yarn.YarnJobManager - Container container_e35_1491556562442_5086_01_000002 is completed with diagnostics: Container [pid=64750,containerID=container_e35_1491556562442_5086_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 2.9 GB of 4.2 GB virtual memory used. Killing container. Dump of the process-tree for container_e35_1491556562442_5086_01_000002 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 64750 64748 64750 64750 (bash) 0 0 108654592 306 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -Xms724m -Xmx724m -XX:MaxDirectMemorySize=1448m -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/ -Dlog.file=/var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.out 2> /var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.err |- 64756 64750 64750 64750 (java) 269053 57593 2961149952 524252 /usr/java/jdk1.7.0_67-cloudera/bin/java -Xms724m -Xmx724m -XX:MaxDirectMemorySize=1448m -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/ -Dlog.file=/var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 There are no application/code side error. Need help to understand what could be the cause ?
The job is killed, because it exceeds memory limits set in Yarn. See this part of your error message: Container [pid=64750,containerID=container_e35_1491556562442_5086_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 2.9 GB of 4.2 GB virtual memory used. Killing container.
Restart a mpi slave after checkpoint before failure on ARMv6
UPDATE I have an university project in which I should build up a cluster with RPis. Now we have a fully functional system with BLCR/MPICH on. BLCR works very well with normal processes linked with the lib. Demonstrations we have to show from our management web interface are: parallel execution of a job migration of processes across the nodes fault tolerance with MPI We are allowed to use the simplest computations. The first one we got easily, with MPI too. The second point we actually have only working with normal processes (without MPI). Regarding the third point I have less idea how to implement a master-slave MPI scheme, in which I can restart a slave process, which also affects point two because we should/can/have_to make a checkpoint of the slave process, kill/stop it and restart it on another node. I know that I have to handle the MPI_Errors myself but how to restore the process? It would be nice if someone could post me a link or paper (with explanations) at least. Thanks in advance UPDATE: As written earlier our BLCR+MPICH stuff works or seems to. But... When I start MPI Processes checkpointing seems to work well. Here the proof: ... snip ... Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... DONE Many salts: 767744 c/s real, 767744 c/s virtual Only one salt: 560896 c/s real, 560896 c/s virtual Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... [proxy:0:0#node2] requesting checkpoint [proxy:0:0#node2] checkpoint completed [proxy:0:1#node1] requesting checkpoint [proxy:0:1#node1] checkpoint completed [proxy:0:2#node3] requesting checkpoint [proxy:0:2#node3] checkpoint completed ... snip ... If I kill one Slave-Process on any node I get this here: ... snip ... =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 9 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== ... snip ... It is ok because we have a checkpoint so we can restart our application. But it doesn't work: pi 7380 0.0 0.2 2984 1012 pts/4 S+ 16:38 0:00 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3 pi 7381 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.101 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0 pi 7382 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.102 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1 pi 7383 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.105 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2 pi 7438 0.0 0.1 3548 868 pts/1 S+ 16:40 0:00 grep --color=auto mpi I don't know why but the first time I restart the app on every node the process seems to be restarted (I got it from using top or ps aux | grep "john" but no output to the management (or on the management console/terminal) is shown. It just hangs up after showing me: mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3 Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts. My plan B is just to test with own application if the BLCR/MPICH stuff really works. Maybe there some troubles with john. Thanks in advance ** UPDATE ** Next problem with simple hello world. I dispair slowly. Maybe I'm confused too much. mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 3 -f /tmp/machinefile -n 4 ./hello Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts. [proxy:0:0#node2] requesting checkpoint [proxy:0:0#node2] checkpoint completed [proxy:0:1#node1] requesting checkpoint [proxy:0:1#node1] checkpoint completed [proxy:0:2#node3] requesting checkpoint [proxy:0:2#node3] checkpoint completed [proxy:0:0#node2] requesting checkpoint [proxy:0:0#node2] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0#node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed [proxy:0:0#node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0#node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:1#node1] requesting checkpoint [proxy:0:1#node1] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:1#node1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed [proxy:0:1#node1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:1#node1] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:2#node3] requesting checkpoint [proxy:0:2#node3] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:2#node3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed [proxy:0:2#node3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:2#node3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec#masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed [mpiexec#masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [mpiexec#masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event [mpiexec#masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion hello.c /* C Example */ #include <stdio.h> #include <mpi.h> int main (argc, argv) int argc; char *argv[]; { int rank, size, i, j; char hostname[1024]; hostname[1023] = '\0'; gethostname(hostname, 1023); MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ i = 0; for(i ; i < 400000000; i++){ for(j; j < 4000000; j++){ } } printf("%s done...", hostname); printf("%s: %d is alive\n", hostname, getpid()); MPI_Finalize(); return 0; }