How can I track down this dump error?
And most important, which process is causing it?
What are the consequences?
It happens almost every weekend:
See sql dump output below:
*Current time is 23:26:40 11/05/17.
=====================================================================
BugCheck Dump
=====================================================================
This file is generated by Microsoft SQL Server
version 13.0.4446.0
upon detection of fatal unexpected error. Please return this file,
the query or program that produced the bugcheck, the database and
the error log, and any other pertinent information with a Service Request.
Computer type is Intel(R) Xeon(R) CPU E5-2698B v3 # 2.00GHz.
Bios Version is VRTUAL - 5001223
Intel(R) Xeon(R) CPU E5-2698B v3 # 2.00GHz
4 X64 level 8664, 10 Mhz processor (s).
Windows NT 6.2 Build 9200 CSD .
Memory
MemoryLoad = 96%
Total Physical = 32767 MB
Available Physical = 994 MB
Total Page File = 39679 MB
Available Page File = 5602 MB
Total Virtual = 134217727 MB
Available Virtual = 134132460 MB
**Dump thread - spid = 0, EC = 0x000001DE6E277240
***Stack Dump being sent to C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\LOG\SQLDump0006.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 11/05/17 23:26:40 spid 38
*
* Latch timeout
*
*
* *******************************************************************************
* -------------------------------------------------------------------------------
* Short Stack Dump*
You can use SQL Server Diagnostics (Preview) available as an extension from SSMS 17.1 onwards to check for any potential causes and any available resolutions
After installing you will find a screen like below and after uploading dump, you can find potential solutions or patches which may help you.Ensure you upload DUMP to a location near to you
You also can load the dump using windbg and play with it if you have right symbols..Further event logs can show you more info
Most of the times Stackdumps are dumped due to bugs..Best way to proceed with them is to raise a ticket with microsoft.
Elapsed time: 0 hours 0 minu
tes 6 seconds. Internal database snapshot has split point LSN = 00014377:000000a5:0001 and first LSN = 00014377:
000000a3:0001.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (master.
**Dump thread - spid = 0, EC = 0x0000022824F95600
You nee check dump file where stack dump detail provides.
We are sharing small piece of code for analysis.
Related
i am using nanodump for dumping lsass.exe.
everything is ok, but when i get to mimikatz by following command,got error:
mimikatz.exe "sekurlsa::minidump <path/to/dumpfile>" "sekurlsa::logonPasswords full" exit
mimikatz error:
ERROR kuhl_m_sekurlsa_acquireLSA ; Memory opening
i use "x64 nanodump ssp dll", and AddSecurityPackage winapi for attaching to lsass
when i was testing all way's, detect that nanodump specified dump file size(default=>report.docx),is different from procmon.exe Full and Mini dump output.
my test:
procmon full = 71 MB
procmon mini = 1.6 MB
nanodump = 11 MB
what can i do for dump by nanodump,compatible with mimikatz::logonpasswords?
this was for invalid file signature dumped by nano ssp module, this probled solved by this command:
[nano git source]/scripts/restore_signature.exe
I have downloaded asimbench files which provided in the gem5.org website and I have modified the config/common/FSConfig.py with following changes:
def makeArmSystem(..)
..................
self.cf0 = CowIdeDisk(driveID='master')
self.cf2 = CowIdeDisk(driveID='master')
self.cf0.childImage(mdesc.disk())
self.cf2.childImage(disk("sdcard-1g-mxplayer.img"))
#Old platforms have a built-in IDE or CF controller. Default to
#the IDE controller if both exist. New platforms expect the
#storage controller to be added from the config script.
if hasattr(self.realview, "ide"):
#self.realview.ide.disks = [self.cf0]
self.realview.ide.disks = [self.cf0, self.cf2]
elif hasattr(self.realview, "cf_ctrl"):
#self.realview.cf_ctrl.disks = [self.cf0]
self.realview.cf_ctrl.disks = [self.cf0, self.cf2]
else:
self.pci_ide = IdeController(disks=[self.cf0])
pci_devices.append(self.pci_ide
I used this command:
./build/ARM/gem5.opt configs/example/fs.py --mem-size=8192MB
--disk-image=/home/yaz/gem5/full_system_images/disks/ARMv7a-ICS-Android.SMP.Asimbench-v3.img
--kernel=/home/yaz/gem5/full_system_images/binaries/vmlinux.smp.ics.arm.asimbench.2.6.35
--os-type=android-ics --cpu-type=MinorCPU --machine-type=VExpress_GEM5 --script=/home/yaz/gem5/full_system_images/boot/adobe.rcS
warn: CheckedInt already exists in allParams. This may be caused by
the Python 2.7 compatibility layer. warn: Enum already exists in
allParams. This may be caused by the Python 2.7 compatibility layer.
warn: ScopedEnum already exists in allParams. This may be caused by
the Python 2.7 compatibility layer. gem5 Simulator System.
http://gem5.org gem5 is copyrighted software; use the --copyright
option for details. gem5 version 20.0.0.3 gem5 compiled Jul 7 2020
16:17:12 gem5 started Jul 16 2020 04:41:50 gem5 executing on
yazeed-OptiPlex-9010, pid 3367 command line: ./build/ARM/gem5.opt
configs/example/fs.py --mem-size=8192MB
--disk-image=/home/yaz/gem5/full_system_images/disks/ARMv7a-ICS-Android.SMP.Asimbench-v3.img
--kernel=/home/yaz/gem5/full_system_images/binaries/vmlinux.smp.ics.arm.asimbench.2.6.35
--os-type=android-ics --cpu-type=MinorCPU --machine-type=VExpress_GEM5 --script=/home/yaz/gem5/full_system_images/boot/adobe.rcS
Global frequency set at 1000000000000 ticks per second
warn: No dot file generated. Please install pydot to generate the dot file and pdf.
info: kernel located at: /home/yaz/gem5/full_system_images/binaries/vmlinux.smp.ics.arm.asimbench.2.6.35
system.vncserver: Listening for connections on port 5900
system.terminal: Listening for connections on port 3456
system.realview.uart1.device: Listening for connections on port 3457
system.realview.uart2.device: Listening for connections on port 3458
system.realview.uart3.device: Listening for connections on port 3459
0: system.remote_gdb: listening for remote gdb on port 7000 info:
Using bootloader at address 0x80000000
info: Using kernel entry physical address at 0x140008000 warn: DTB file specified, but no
device tree support in kernel
**** REAL SIMULATION ****
warn:Existing EnergyCtrl, but no enabled DVFSHandler found. info: Entering
event queue # 0. Starting simulation...
fatal: Unable to find destination for [0x40008000:0x40008040] on system.iobus
Memory Usage: 8786764 KBytes
Thanks for helping
We upgraded Cassandra (5+5 nodes) 2.0.9 to 2.1.2 (binaries) and ran nodetool upgradesstables one-by-one (bash script), after this we observe some problems:
on every node we observe about 50 "Pending Tasks" on one of them more than 500, it has persist for 5 days - when we started nodetool upgradesstables, even if concurrent_compactors is set to 8 cassandra never run more than 3-4 at the same time. One node with more than 500 tasks pending has about 11k files in column family directory...
we have 2 ssd disks but during compacting there is up to 10MB/s reads and maximum 5MB/s writes - even if compaction_throughput_mb_per_sec is set to 32 or 64 or 256
during upgradesstables on some tables got :
WARN [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 ColumnFamilyStore.java:2492 - Unable to cancel in-progress compactions for reco_active_items_v1. Perhaps there is an unusually large row in progress somewhere, or the system is simply overloaded.
INFO [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 CompactionManager.java:247 - Aborting operation on reco_prod.reco_active_items_v1 after failing to interrupt other compaction operations
nodetool is failing with:
Aborted upgrading sstables for atleast one column family in keyspace reco_prod, check server logs for more information.
on some nodes nodetool upgradesstables finished succefully but still can see jb files in column family directory.
nodetool upgradesstables on some nodes returns:
error: null
-- StackTrace --
java.lang.NullPointerException
at org.apache.cassandra.io.sstable.SSTableReader.cloneWithNewStart(SSTableReader.java:952)
at org.apache.cassandra.io.sstable.SSTableRewriter.moveStarts(SSTableRewriter.java:250)
at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:300)
at org.apache.cassandra.io.sstable.SSTableRewriter.abort(SSTableRewriter.java:186)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:204)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$4.execute(CompactionManager.java:340)
at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is our production env (24h) and we observe higher load on nodes , higher read latency even more than 1 sec.
Any advise...?
The problem incident:
Our production system started denying services with an error message "Too many open files in system". Most of the services were affected, including inability to start a new ssh session, or even log in into virtual console from the physical terminal. Luckily, one root ssh session was open, so we could interact with the system (morale: keep one root session always open!). As a side effect, some services (named, dbus-daemon, rsyslogd, avahi-daemon) saturated the CPU (100% load). The system also serves a large directory via NFS to a very busy client which was backing up 50000 small files at the moment. Restarting all kinds of services and programs normalized their CPU behavior, but did not solve the "Too many open files in system" problem.
The suspected cause
Most likely, some program is leaking file handles. Probably the culprit is my tcl program, which also saturated the CPU (not normal). However, killing it did not help, but, most disturbingly, lsof would not reveal large amounts of open files.
Some evidence
We had to reboot, so whatever information was collected is all we have.
root#xeon:~# cat /proc/sys/fs/file-max
205900
root#xeon:~# lsof
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
init 1 root cwd DIR 8,6 4096 2 /
init 1 root rtd DIR 8,6 4096 2 /
init 1 root txt REG 8,6 124704 7979050 /sbin/init
init 1 root mem REG 8,6 42580 5357606 /lib/i386-linux-gnu/libnss_files-2.13.so
init 1 root mem REG 8,6 243400 5357572 /lib/i386-linux-gnu/libdbus-1.so.3.5.4
...
A pretty normal list, definitely not 200K files, more like two hundred.
This is probably, where the problem started:
less /var/log/syslog
Mar 27 06:54:01 xeon CRON[16084]: (CRON) error (grandchild #16090 failed with exit status 1)
Mar 27 06:54:21 xeon kernel: [8848865.426732] VFS: file-max limit 205900 reached
Mar 27 06:54:29 xeon postfix/master[1435]: warning: master_wakeup_timer_event: service pickup(public/pickup): Too many open files in system
Mar 27 06:54:29 xeon kernel: [8848873.611491] VFS: file-max limit 205900 reached
Mar 27 06:54:32 xeon kernel: [8848876.293525] VFS: file-max limit 205900 reached
netstat did not show noticeable anomalies either.
The man pages for ps and top do not indicate an ability to show open file count. Probably the problem will repeat itself after a few months (that was our uptime).
Any ideas on what else can be done to identify the open files?
UPDATE
This question has changed the meaning, after qehgt identified the likely cause.
Apart from the bug in NFS v4 code, I suspect there is a design limitation in Linux and kernel-leaked file handles can NOT be identified. Consequently, the original question transforms into:
"Who is responsible for file handles in the Linux kernel?" and "Where do I post that question?". The 1st answer was helpful, but I am willing to accept a better answer.
Probably the root cause is a bug in NFSv4 implementation: https://stackoverflow.com/a/5205459/280758
They have similar symptoms.
I am using Red Hat 5.5 and I am trying to run Sybase ASE 12.5.4.
Yesterday I was trying to use the command "service sybase start" and the console showed sybase repeatedly trying to initialize, but failing, the main database server.
UPDATE:
I initialized a database at /ims_systemdb/master using the following commands:
dataserver -d /ims_systemdb/master -z 2k -b 51204 -c $SYBASE/ims.cfg -e db_error.log
chmod a=rwx /ims_systemdb/master
ls -al /ims_systemdb/master
And it gives me a nice database at /ims_systemdb/master with a size of 104865792 bytes (2048x51240).
But when I run
service sybase start
The error log at /logs/sybase_error.log goes like this:
00:00000:00000:2013/04/26 16:11:45.18 kernel Using config area from primary master device.
00:00000:00000:2013/04/26 16:11:45.19 kernel Detected 1 physical CPU
00:00000:00000:2013/04/26 16:11:45.19 kernel os_create_region: can't allocate 11534336000 bytes
00:00000:00000:2013/04/26 16:11:45.19 kernel kbcreate: couldn't create kernel region.
00:00000:00000:2013/04/26 16:11:45.19 kernel kistartup: could not create shared memory
I read "os_create_region" is normal if you don't set shmmax in sysctl high enough, so I set it to 16000000000000, but I still get this error. And sometimes, when I'm playing around with the .cfg file, I get this error message instead:
00:00000:00000:2013/04/25 14:04:08.28 kernel Using config area from primary master device.
00:00000:00000:2013/04/25 14:04:08.29 kernel Detected 1 physical CPU
00:00000:00000:2013/04/25 14:04:08.85 server The size of each partitioned pool must have atleast 512K. With the '16' partitions we cannot configure this value f
Why do these two errors appear and what can I do about them?
UPDATE:
Currently, I'm seeing the 1st error message (os cannot allocate bytes). The contents of /etc/sysctl.conf are as follows:
kernel.shmmax = 4294967295
kernel.shmall = 1048576
kernel.shmmni = 4096
But the log statements earlier state that
os_create_region: can't allocate 11534336000 bytes
So why is the region it is trying to allocate so big, and where did that get set?
The Solution:
When you get a message like "os_create_region: can't allocate 11534336000 bytes", what it means is that Sybase's configuration file is asking the kernel to create a region that exceeds the shmmax variable in /etc/sysctl.conf
The main thing to do is to change ims.conf (or whatever configuration file you are using). Then, you change the max memory variable in the physical memory section.
[Physical Memory]
max memory = 64000
additional network memory = 10485760
shared memory starting address = DEFAULT
allocate max shared memory = 1
For your information, my /etc/sysctl.conf file ended with these three lines:
kernel.shmmax = 16000000000
kernel.shmall = 16000000000
kernel.shmmni = 8192
And once this is done, type "showserver" to reveal what processes are running.
For more information, consult the Sybase System Administrator's Guide, volume 2 as well as Michael Gardner's link to Red Hat memory management in the comments earlier.