Flink Streaming Job is Failed Automatically - apache-flink

I am running a flink streaming job with parallelism 1 .
Suddenly after 8 hours job failed . It showed
Association with remote system [akka.tcp://flink#192.168.3.153:44863] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2017-04-12 00:48:36,683 INFO org.apache.flink.yarn.YarnJobManager - Container container_e35_1491556562442_5086_01_000002 is completed with diagnostics: Container [pid=64750,containerID=container_e35_1491556562442_5086_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 2.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_e35_1491556562442_5086_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 64750 64748 64750 64750 (bash) 0 0 108654592 306 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -Xms724m -Xmx724m -XX:MaxDirectMemorySize=1448m -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/ -Dlog.file=/var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.out 2> /var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.err
|- 64756 64750 64750 64750 (java) 269053 57593 2961149952 524252 /usr/java/jdk1.7.0_67-cloudera/bin/java -Xms724m -Xmx724m -XX:MaxDirectMemorySize=1448m -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/ -Dlog.file=/var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir .
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
There are no application/code side error.
Need help to understand what could be the cause ?

The job is killed, because it exceeds memory limits set in Yarn.
See this part of your error message:
Container [pid=64750,containerID=container_e35_1491556562442_5086_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 2.9 GB of 4.2 GB virtual memory used. Killing container.

Related

How to restrict mongodb's ROM usage on 32 bit architecture system? I am using mongodb version 2.4

I am using mongodb version 2.4 due to 32 bit system limitation (nanopi m1 plus). I have a debian jessie OS image (Debian 8) with 4.2 GB space available (emmc). I have a total of about 2.2 GB available after loading my application files. However, my flash gets filled up quickly to 100% after I start running my application.
Then I get the error "Unable to get database instance and mongodb stopped working " and my application stops working.
Can someone please help me with this problem. Thanks in advance!
Memory status of my device when it stopped working:
df –h:
Filesystem- overlay Size- 4.2G Used- 4.2G Avail- 0 Use%- 100% Mounted on- /
df -h command
du –shx /var/lib/mongodb/ | sort –rh | head –n 20
512M /var/lib/mongodb/xyz.6
512M /var/lib/mongodb/xyz.5
257M /var/lib/mongodb/xyz.4
128M /var/lib/mongodb/xyz.3
64M /var/lib/mongodb/xyz.2
32M /var/lib/mongodb/xyz.1
17M /var/lib/mongodb/xyz.ns
17M /var/lib/mongodb/xyz.0
16M /var/lib/mongodb/local.ns
16M /var/lib/mongodb/local.0
4.0K /var/lib/mongodb/journal
0 /var/lib/mongodb/mongodb.lock
du –shx /var/lib/mongodb/journal/* | sort –rh | head –n 20
257M /var/lib/mongodb/journal/prealloc.2
257M /var/lib/mongodb/journal/prealloc.1
257M /var/lib/mongodb/journal/prealloc.0
du –shx /var/log/mongodb/journal/* | sort –rh | head –n 20
399M /var/lib/mongodb/mongodb.log
353M /var/lib/mongodb/mongodb.log.1
3.3M /var/lib/mongodb/mongodb.log.2.gz
752K /var/lib/mongodb/mongodb.log.1.gz

Paragraph execution in zeppelin goes to pending state after some time and hadoop application status for zeppelin is finished

I am using zeppelin from the last 3 months and noticed this strange problem recently. Everyday morning I had to restart zeppelin for it to work or else the paragraph execution will go to pending state and never run. I tried to dig deeper to check what is the problem. The state of the zeppelin application in yarn is finshed. I tried to check the log and it shows the below error. Couldn't make out anything out of it.
2017-06-28 22:04:08,986 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 56876 for container-id container_1498627544571_0001_01_000002: 1.2 GB of 4 GB physical memory used; 4.0 GB of 20 GB virtual memory used
2017-06-28 22:04:08,995 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 56787 for container-id container_1498627544571_0001_01_000001: 330.2 MB of 1 GB physical memory used; 1.4 GB of 5 GB virtual memory used
2017-06-28 22:04:09,964 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1498627544571_0001_01_000002 is : 1
2017-06-28 22:04:09,965 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1498627544571_0001_01_000002 and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1498627544571_0001_01_000002
2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
I am the only user in that environment and no one else is using it. There isn't any process running at that time as well. Couldn't understand why it is happening.

How to limit the maximum memory a process can use in Centos?

I want to limit the maximum memory a process can you in Centos. There can be scenarios where a process ends up using all of the available memory or most of the memory affecting other processes in the system. Therefore, I want to know how this can be limited.
Also, if you can give a sample program where you are limiting the memory usage of a process and show the following scenarios that would be helpful.
Memory allocation successful when requested memory within the set limits.
Memory allocation failed when requested memory above the set limits.
-Thanks
ulimit can be used to limit memory utilization (among other things)
Here is an example of setting memory usage so low that /bin/ls (which is larger than /bin/cat) no longer works, but /bin/cat still works.
$ ls -lh /bin/ls /bin/cat
-rwxr-xr-x 1 root root 25K May 24 2008 /bin/cat
-rwxr-xr-x 1 root root 88K May 24 2008 /bin/ls
$ date > test.txt
$ ulimit -d 10000 -m 10000 -v 10000
$ /bin/ls date.txt
/bin/ls: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory
$ /bin/cat date.txt
Thu Mar 26 11:51:16 PDT 2009
$
Note: If I set the limits to 1000 kilobytes, neither program works, because they load libraries, which increase their size. above 1000 KB.
-d data segment size
-m max memory size
-v virtual memory size
Run ulimit -a to see all the resource caps ulimits can set.

How to find which process is leaking file handles in Linux?

The problem incident:
Our production system started denying services with an error message "Too many open files in system". Most of the services were affected, including inability to start a new ssh session, or even log in into virtual console from the physical terminal. Luckily, one root ssh session was open, so we could interact with the system (morale: keep one root session always open!). As a side effect, some services (named, dbus-daemon, rsyslogd, avahi-daemon) saturated the CPU (100% load). The system also serves a large directory via NFS to a very busy client which was backing up 50000 small files at the moment. Restarting all kinds of services and programs normalized their CPU behavior, but did not solve the "Too many open files in system" problem.
The suspected cause
Most likely, some program is leaking file handles. Probably the culprit is my tcl program, which also saturated the CPU (not normal). However, killing it did not help, but, most disturbingly, lsof would not reveal large amounts of open files.
Some evidence
We had to reboot, so whatever information was collected is all we have.
root#xeon:~# cat /proc/sys/fs/file-max
205900
root#xeon:~# lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
init 1 root cwd DIR 8,6 4096 2 /
init 1 root rtd DIR 8,6 4096 2 /
init 1 root txt REG 8,6 124704 7979050 /sbin/init
init 1 root mem REG 8,6 42580 5357606 /lib/i386-linux-gnu/libnss_files-2.13.so
init 1 root mem REG 8,6 243400 5357572 /lib/i386-linux-gnu/libdbus-1.so.3.5.4
...
A pretty normal list, definitely not 200K files, more like two hundred.
This is probably, where the problem started:
less /var/log/syslog
Mar 27 06:54:01 xeon CRON[16084]: (CRON) error (grandchild #16090 failed with exit status 1)
Mar 27 06:54:21 xeon kernel: [8848865.426732] VFS: file-max limit 205900 reached
Mar 27 06:54:29 xeon postfix/master[1435]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Mar 27 06:54:29 xeon kernel: [8848873.611491] VFS: file-max limit 205900 reached
Mar 27 06:54:32 xeon kernel: [8848876.293525] VFS: file-max limit 205900 reached
netstat did not show noticeable anomalies either.
The man pages for ps and top do not indicate an ability to show open file count. Probably the problem will repeat itself after a few months (that was our uptime).
Any ideas on what else can be done to identify the open files?
UPDATE
This question has changed the meaning, after qehgt identified the likely cause.
Apart from the bug in NFS v4 code, I suspect there is a design limitation in Linux and kernel-leaked file handles can NOT be identified. Consequently, the original question transforms into:
"Who is responsible for file handles in the Linux kernel?" and "Where do I post that question?". The 1st answer was helpful, but I am willing to accept a better answer.
Probably the root cause is a bug in NFSv4 implementation: https://stackoverflow.com/a/5205459/280758
They have similar symptoms.

Sybase initializes but does not run

I am using Red Hat 5.5 and I am trying to run Sybase ASE 12.5.4.
Yesterday I was trying to use the command "service sybase start" and the console showed sybase repeatedly trying to initialize, but failing, the main database server.
UPDATE:
I initialized a database at /ims_systemdb/master using the following commands:
dataserver -d /ims_systemdb/master -z 2k -b 51204 -c $SYBASE/ims.cfg -e db_error.log
chmod a=rwx /ims_systemdb/master
ls -al /ims_systemdb/master
And it gives me a nice database at /ims_systemdb/master with a size of 104865792 bytes (2048x51240).
But when I run
service sybase start
The error log at /logs/sybase_error.log goes like this:
00:00000:00000:2013/04/26 16:11:45.18 kernel Using config area from primary master device.
00:00000:00000:2013/04/26 16:11:45.19 kernel Detected 1 physical CPU
00:00000:00000:2013/04/26 16:11:45.19 kernel os_create_region: can't allocate 11534336000 bytes
00:00000:00000:2013/04/26 16:11:45.19 kernel kbcreate: couldn't create kernel region.
00:00000:00000:2013/04/26 16:11:45.19 kernel kistartup: could not create shared memory
I read "os_create_region" is normal if you don't set shmmax in sysctl high enough, so I set it to 16000000000000, but I still get this error. And sometimes, when I'm playing around with the .cfg file, I get this error message instead:
00:00000:00000:2013/04/25 14:04:08.28 kernel Using config area from primary master device.
00:00000:00000:2013/04/25 14:04:08.29 kernel Detected 1 physical CPU
00:00000:00000:2013/04/25 14:04:08.85 server The size of each partitioned pool must have atleast 512K. With the '16' partitions we cannot configure this value f
Why do these two errors appear and what can I do about them?
UPDATE:
Currently, I'm seeing the 1st error message (os cannot allocate bytes). The contents of /etc/sysctl.conf are as follows:
kernel.shmmax = 4294967295
kernel.shmall = 1048576
kernel.shmmni = 4096
But the log statements earlier state that
os_create_region: can't allocate 11534336000 bytes
So why is the region it is trying to allocate so big, and where did that get set?
The Solution:
When you get a message like "os_create_region: can't allocate 11534336000 bytes", what it means is that Sybase's configuration file is asking the kernel to create a region that exceeds the shmmax variable in /etc/sysctl.conf
The main thing to do is to change ims.conf (or whatever configuration file you are using). Then, you change the max memory variable in the physical memory section.
[Physical Memory]
max memory = 64000
additional network memory = 10485760
shared memory starting address = DEFAULT
allocate max shared memory = 1
For your information, my /etc/sysctl.conf file ended with these three lines:
kernel.shmmax = 16000000000
kernel.shmall = 16000000000
kernel.shmmni = 8192
And once this is done, type "showserver" to reveal what processes are running.
For more information, consult the Sybase System Administrator's Guide, volume 2 as well as Michael Gardner's link to Red Hat memory management in the comments earlier.

Resources