Paragraph execution in zeppelin goes to pending state after some time and hadoop application status for zeppelin is finished - apache-zeppelin

I am using zeppelin from the last 3 months and noticed this strange problem recently. Everyday morning I had to restart zeppelin for it to work or else the paragraph execution will go to pending state and never run. I tried to dig deeper to check what is the problem. The state of the zeppelin application in yarn is finshed. I tried to check the log and it shows the below error. Couldn't make out anything out of it.
2017-06-28 22:04:08,986 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 56876 for container-id container_1498627544571_0001_01_000002: 1.2 GB of 4 GB physical memory used; 4.0 GB of 20 GB virtual memory used
2017-06-28 22:04:08,995 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 56787 for container-id container_1498627544571_0001_01_000001: 330.2 MB of 1 GB physical memory used; 1.4 GB of 5 GB virtual memory used
2017-06-28 22:04:09,964 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1498627544571_0001_01_000002 is : 1
2017-06-28 22:04:09,965 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1498627544571_0001_01_000002 and exit code: 1
ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1498627544571_0001_01_000002
2017-06-28 22:04:09,972 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
I am the only user in that environment and no one else is using it. There isn't any process running at that time as well. Couldn't understand why it is happening.

Related

vsphere-csi-controller fails to start due to "invalid memory address or nil pointer dereference"

I am installing k8s and vsphere CPI/CSI following the instructions located here
My setup:
2x centos 7.7 vSphere VM's (50g hd/16g ram), 1 master & 1 node in k8s cluster.
Made it to the part where I create the storageClass (near the end) when I discovered this github issue exactly. OP of the linked issue just started from scratch and their issue went away, so the report was closed. This has not been the case for me as I've redeployed my k8s cluster from scratch a bunch of times now and always hit this wall. Below is the error if you don't want to check the linked github issue.
Anyone have ideas on what I can try to get past this? I've checked my hd and ram and plenty there.
# kubectl -n kube-system logs pod/vsphere-csi-controller-0 vsphere-csi-controller
I0127 18:49:43.292667 1 config.go:261] GetCnsconfig called with cfgPath: /etc/cloud/csi-vsphere.conf
I0127 18:49:43.292859 1 config.go:206] Initializing vc server 132.250.31.180
I0127 18:49:43.292867 1 controller.go:67] Initializing CNS controller
I0127 18:49:43.292884 1 virtualcentermanager.go:63] Initializing defaultVirtualCenterManager...
I0127 18:49:43.292892 1 virtualcentermanager.go:65] Successfully initialized defaultVirtualCenterManager
I0127 18:49:43.292905 1 virtualcentermanager.go:107] Successfully registered VC "132.250.31.180"
I0127 18:49:43.292913 1 manager.go:60] Initializing volume.volumeManager...
I0127 18:49:43.292917 1 manager.go:64] volume.volumeManager initialized
time="2020-01-27T18:50:03Z" level=info msg="received signal; shutting down" signal=terminated
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x128 pc=0x867dc7]
goroutine 10 [running]:
google.golang.org/grpc.(*Server).GracefulStop(0x0)
/go/pkg/mod/google.golang.org/grpc#v1.23.0/server.go:1393 +0x37
github.com/rexray/gocsi.(*StoragePlugin).GracefulStop.func1()
/go/pkg/mod/github.com/rexray/gocsi#v1.0.0/gocsi.go:333 +0x35
sync.(*Once).Do(0xc0002cc8fc, 0xc000380ef8)
/usr/local/go/src/sync/once.go:44 +0xb3
github.com/rexray/gocsi.(*StoragePlugin).GracefulStop(0xc0002cc870, 0x21183a0, 0xc000056018)
/go/pkg/mod/github.com/rexray/gocsi#v1.0.0/gocsi.go:332 +0x56
github.com/rexray/gocsi.Run.func3()
/go/pkg/mod/github.com/rexray/gocsi#v1.0.0/gocsi.go:121 +0x4e
github.com/rexray/gocsi.trapSignals.func1(0xc00052a240, 0xc000426990, 0xc000426900)
/go/pkg/mod/github.com/rexray/gocsi#v1.0.0/gocsi.go:502 +0x143
created by github.com/rexray/gocsi.trapSignals
/go/pkg/mod/github.com/rexray/gocsi#v1.0.0/gocsi.go:487 +0x107
Ok turns out this SIGSEGV was a bug or something and it was caused by a network timeout, making this error kind of a red herring.
Details: My vsphere-csi-controller-0 pod was (and still is actually) unable to reach the vsphere server which caused the container in the pod to timeout and trigger this SIGSEV fault. The CSI contributers updated some libraries and the fault is now gone but the timeout remains. Timeout appears to be my problem and not related to CSI but that's a new question :)
If you want the details of what was fixed in the CSI check the github link in the question.

How to track down sql dump error?

How can I track down this dump error?
And most important, which process is causing it?
What are the consequences?
It happens almost every weekend:
See sql dump output below:
*Current time is 23:26:40 11/05/17.
=====================================================================
BugCheck Dump
=====================================================================
This file is generated by Microsoft SQL Server
version 13.0.4446.0
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.
Computer type is Intel(R) Xeon(R) CPU E5-2698B v3 # 2.00GHz.
Bios Version is VRTUAL - 5001223
Intel(R) Xeon(R) CPU E5-2698B v3 # 2.00GHz
4 X64 level 8664, 10 Mhz processor (s).
Windows NT 6.2 Build 9200 CSD .
Memory
MemoryLoad = 96%
Total Physical = 32767 MB
Available Physical = 994 MB
Total Page File = 39679 MB
Available Page File = 5602 MB
Total Virtual = 134217727 MB
Available Virtual = 134132460 MB
**Dump thread - spid = 0, EC = 0x000001DE6E277240
***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\LOG\SQLDump0006.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 11/05/17 23:26:40 spid 38
*
* Latch timeout
*
*
* *******************************************************************************
* -------------------------------------------------------------------------------
* Short Stack Dump*
You can use SQL Server Diagnostics (Preview) available as an extension from SSMS 17.1 onwards to check for any potential causes and any available resolutions
After installing you will find a screen like below and after uploading dump, you can find potential solutions or patches which may help you.Ensure you upload DUMP to a location near to you
You also can load the dump using windbg and play with it if you have right symbols..Further event logs can show you more info
Most of the times Stackdumps are dumped due to bugs..Best way to proceed with them is to raise a ticket with microsoft.
Elapsed time: 0 hours 0 minu
tes 6 seconds. Internal database snapshot has split point LSN = 00014377:000000a5:0001 and first LSN = 00014377:
000000a3:0001.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (master.
**Dump thread - spid = 0, EC = 0x0000022824F95600
You nee check dump file where stack dump detail provides.
We are sharing small piece of code for analysis.

Flink Streaming Job is Failed Automatically

I am running a flink streaming job with parallelism 1 .
Suddenly after 8 hours job failed . It showed
Association with remote system [akka.tcp://flink#192.168.3.153:44863] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].
2017-04-12 00:48:36,683 INFO org.apache.flink.yarn.YarnJobManager - Container container_e35_1491556562442_5086_01_000002 is completed with diagnostics: Container [pid=64750,containerID=container_e35_1491556562442_5086_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 2.9 GB of 4.2 GB virtual memory used. Killing container.
Dump of the process-tree for container_e35_1491556562442_5086_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 64750 64748 64750 64750 (bash) 0 0 108654592 306 /bin/bash -c /usr/java/jdk1.7.0_67-cloudera/bin/java -Xms724m -Xmx724m -XX:MaxDirectMemorySize=1448m -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/ -Dlog.file=/var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> /var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.out 2> /var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.err
|- 64756 64750 64750 64750 (java) 269053 57593 2961149952 524252 /usr/java/jdk1.7.0_67-cloudera/bin/java -Xms724m -Xmx724m -XX:MaxDirectMemorySize=1448m -Djava.library.path=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native/ -Dlog.file=/var/log/hadoop-yarn/container/application_1491556562442_5086/container_e35_1491556562442_5086_01_000002/taskmanager.log -Dlogback.configurationFile=file:logback.xml -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnTaskManagerRunner --configDir .
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
There are no application/code side error.
Need help to understand what could be the cause ?
The job is killed, because it exceeds memory limits set in Yarn.
See this part of your error message:
Container [pid=64750,containerID=container_e35_1491556562442_5086_01_000002] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 2.9 GB of 4.2 GB virtual memory used. Killing container.

Cassandra 2.0.7 to 2.1.2 sstable upgradesstables, compaction problems

We upgraded Cassandra (5+5 nodes) 2.0.9 to 2.1.2 (binaries) and ran nodetool upgradesstables one-by-one (bash script), after this we observe some problems:
on every node we observe about 50 "Pending Tasks" on one of them more than 500, it has persist for 5 days - when we started nodetool upgradesstables, even if concurrent_compactors is set to 8 cassandra never run more than 3-4 at the same time. One node with more than 500 tasks pending has about 11k files in column family directory...
we have 2 ssd disks but during compacting there is up to 10MB/s reads and maximum 5MB/s writes - even if compaction_throughput_mb_per_sec is set to 32 or 64 or 256
during upgradesstables on some tables got :
WARN [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 ColumnFamilyStore.java:2492 - Unable to cancel in-progress compactions for reco_active_items_v1. Perhaps there is an unusually large row in progress somewhere, or the system is simply overloaded.
INFO [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 CompactionManager.java:247 - Aborting operation on reco_prod.reco_active_items_v1 after failing to interrupt other compaction operations
nodetool is failing with:
Aborted upgrading sstables for atleast one column family in keyspace reco_prod, check server logs for more information.
on some nodes nodetool upgradesstables finished succefully but still can see jb files in column family directory.
nodetool upgradesstables on some nodes returns:
error: null
-- StackTrace --
java.lang.NullPointerException
at org.apache.cassandra.io.sstable.SSTableReader.cloneWithNewStart(SSTableReader.java:952)
at org.apache.cassandra.io.sstable.SSTableRewriter.moveStarts(SSTableRewriter.java:250)
at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:300)
at org.apache.cassandra.io.sstable.SSTableRewriter.abort(SSTableRewriter.java:186)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:204)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$4.execute(CompactionManager.java:340)
at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is our production env (24h) and we observe higher load on nodes , higher read latency even more than 1 sec.
Any advise...?

How to find which process is leaking file handles in Linux?

The problem incident:
Our production system started denying services with an error message "Too many open files in system". Most of the services were affected, including inability to start a new ssh session, or even log in into virtual console from the physical terminal. Luckily, one root ssh session was open, so we could interact with the system (morale: keep one root session always open!). As a side effect, some services (named, dbus-daemon, rsyslogd, avahi-daemon) saturated the CPU (100% load). The system also serves a large directory via NFS to a very busy client which was backing up 50000 small files at the moment. Restarting all kinds of services and programs normalized their CPU behavior, but did not solve the "Too many open files in system" problem.
The suspected cause
Most likely, some program is leaking file handles. Probably the culprit is my tcl program, which also saturated the CPU (not normal). However, killing it did not help, but, most disturbingly, lsof would not reveal large amounts of open files.
Some evidence
We had to reboot, so whatever information was collected is all we have.
root#xeon:~# cat /proc/sys/fs/file-max
205900
root#xeon:~# lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
init 1 root cwd DIR 8,6 4096 2 /
init 1 root rtd DIR 8,6 4096 2 /
init 1 root txt REG 8,6 124704 7979050 /sbin/init
init 1 root mem REG 8,6 42580 5357606 /lib/i386-linux-gnu/libnss_files-2.13.so
init 1 root mem REG 8,6 243400 5357572 /lib/i386-linux-gnu/libdbus-1.so.3.5.4
...
A pretty normal list, definitely not 200K files, more like two hundred.
This is probably, where the problem started:
less /var/log/syslog
Mar 27 06:54:01 xeon CRON[16084]: (CRON) error (grandchild #16090 failed with exit status 1)
Mar 27 06:54:21 xeon kernel: [8848865.426732] VFS: file-max limit 205900 reached
Mar 27 06:54:29 xeon postfix/master[1435]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Mar 27 06:54:29 xeon kernel: [8848873.611491] VFS: file-max limit 205900 reached
Mar 27 06:54:32 xeon kernel: [8848876.293525] VFS: file-max limit 205900 reached
netstat did not show noticeable anomalies either.
The man pages for ps and top do not indicate an ability to show open file count. Probably the problem will repeat itself after a few months (that was our uptime).
Any ideas on what else can be done to identify the open files?
UPDATE
This question has changed the meaning, after qehgt identified the likely cause.
Apart from the bug in NFS v4 code, I suspect there is a design limitation in Linux and kernel-leaked file handles can NOT be identified. Consequently, the original question transforms into:
"Who is responsible for file handles in the Linux kernel?" and "Where do I post that question?". The 1st answer was helpful, but I am willing to accept a better answer.
Probably the root cause is a bug in NFSv4 implementation: https://stackoverflow.com/a/5205459/280758
They have similar symptoms.

Resources