Running Zookeeper 3.4.12 and Solr 6.5.1 with Systemd - Solr immediately shuts down after starting - solr

I'm fairly new systemd and zookeeper. Please be patient, thanks.
Any help is appreciated!
The Setup:
Suse 12 Enterprise
No init.d, only systemd
zookeeper 3.4.12 is running via systemd and listening on default port 2181
solr 6.5.1 must also run via systemd, but immediately after start solr decide to shutdown for unknown reason
solr.log shows establishing connecting to zookeeper and shortly after a warning and shutdown
zookeeper logfile don't has not a single line about solr dropping out
I'm confused and unable to determine if this issue with systemd unit files or zookeeper <-> solr issue.
Questions:
Is the unit file solr.service correct? (I'm not sure about that, examples on net are very scarse)
Is this systemd issue or zookeeper problem?
Which logs can I switch turn on to get more insight?
As #MatsLindh points out this is an sytemd issue. Solr log WARN was just a coincident.
journalctl -u solr
Sep 05 16:42:36 mucs75561 systemd[1]: Started Apache Solr Service.
Sep 05 16:42:40 mucs75561 solr[15732]: [98B blob data]
Sep 05 16:42:40 mucs75561 solr[15732]: Started Solr server on port 8983
(pid=15857). Happy searching!
Sep 05 16:42:40 mucs75561 solr[15942]: Sending stop command to Solr running on port 8983 ...
waiting up to 180 seconds to allow Jetty process 15857 to stop gracefully.
The solr.log tells a different story (tail -n 1000 -f /opt/xxx/solr-6.5.1/server/logs/solr.log)
cat /opt/xxx/solr-6.5.1/server/logs/solr.log
16:42:38.594 INFO (main) [ ] o.e.j.s.Server jetty-9.3.14.v20161028
16:42:38.992 INFO (main) [ ] o.a.s.s.SolrDispatchFilter ___ _ Welcome to Apache Solr™ version 6.5.1
16:42:38.996 INFO (main) [ ] o.a.s.s.SolrDispatchFilter / __| ___| |_ _ Starting in cloud mode on port 8983
16:42:38.996 INFO (main) [ ] o.a.s.s.SolrDispatchFilter \__ \/ _ \ | '_| Install dir: /opt/xxx/solr-6.5.1
16:42:39.016 INFO (main) [ ] o.a.s.s.SolrDispatchFilter |___/\___/_|_| Start time: 2018-09-05T16:42:38.998Z
16:42:39.017 INFO (main) [ ] o.a.s.s.StartupLoggingUtils Property solr.log.muteconsole given. Muting ConsoleAppender named CONSOLE
16:42:39.035 INFO (main) [ ] o.a.s.c.SolrResourceLoader Using system property solr.solr.home: /opt/xxx/solr-6.5.1/server/solr
16:42:39.099 INFO (main) [ ] o.a.s.s.SolrDispatchFilter Loading solr.xml from SolrHome (not found in ZooKeeper)
16:42:39.100 INFO (main) [ ] o.a.s.c.SolrXmlConfig Loading container configuration from /opt/xxx/solr-6.5.1/server/solr/solr.xml
16:42:39.413 INFO (main) [ ] o.a.s.u.UpdateShardHandler Creating UpdateShardHandler HTTP client with params: socketTimeout=600000&connTimeout=60000&retry=true
16:42:39.418 INFO (main) [ ] o.a.s.c.ZkContainer Zookeeper client=localhost:2181/solr
16:42:39.510 INFO (main) [ ] o.a.s.c.Overseer Overseer (id=null) closing
16:42:39.514 INFO (main) [ ] o.a.s.c.OverseerElectionContext I am going to be the leader 192.168.18.49:8983_solr
16:42:39.519 INFO (main) [ ] o.a.s.c.Overseer Overseer (id=72167078483197975-192.168.18.49:8983_solr-n_0000000009) starting
16:42:39.616 INFO (main) [ ] o.a.s.c.ZkController Register node as live in ZooKeeper:/live_nodes/192.168.18.49:8983_solr
16:42:39.622 INFO (OverseerStateUpdate-72167078483197975-192.168.18.49:8983_solr-n_0000000009) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (0) -> (1)
16:42:39.812 INFO (main) [ ] o.a.s.c.CorePropertiesLocator Found 0 core definitions underneath /opt/xxx/solr-6.5.1/server/solr
16:42:39.898 INFO (main) [ ] o.e.j.s.Server Started #1864ms
16:42:40.549 INFO (ShutdownMonitor) [ ] o.a.s.c.CoreContainer Shutting down CoreContainer instance=966739377
16:42:40.557 INFO (ShutdownMonitor) [ ] o.a.s.c.Overseer Overseer (id=72167078483197975-192.168.18.49:8983_solr-n_0000000009) closing
16:42:40.558 INFO (OverseerStateUpdate-72167078483197975-192.168.18.49:8983_solr-n_0000000009) [ ] o.a.s.c.Overseer Overseer Loop exiting : 192.168.18.49:8983_solr
16:42:40.566 WARN (zkCallback-5-thread-1-processing-n:192.168.18.49:8983_solr) [ ] o.a.s.c.c.ZkStateReader ZooKeeper watch triggered, but Solr cannot talk to ZK: [KeeperErrorCode = Session expired for /live_nodes]
16:42:40.566 INFO (ShutdownMonitor) [ ] o.a.s.m.SolrMetricManager Closing metric reporters for: solr.node
My /etc/systemd/system/solr.service:
[Unit]
Description=Apache Solr Service
After=syslog.target network.target nss-lookup.target
Requires=zookeeper.service
[Service]
User=xxx
Group=tomcat
WorkingDirectory=/opt/xxx/solr-6.5.1/
Environment=SOLR_INCLUDE=/opt/xxx/solr-6.5.1/bin/solr.in.sh
ExecStart=/opt/xxx/solr-6.5.1/bin/solr start -m 4g -c -z localhost:2181/solr
ExecStop=/opt/xxx/solr-6.5.1/bin/solr stop -all
[Install]
WantedBy=default.target
Thanks, for reading!

systemd requires the service that it starts to remain running. Since the Solr startup scripts exits after starting Solr (i.e. it daemonizes the process and leaves it running in the background), systemd thinks it's dead and attempts to stop it.
You can start solr in the foreground with bin/solr start -f:
-f Start Solr in foreground; default starts Solr in the background
and sends stdout / stderr to solr-PORT-console.log

Related

gem5 full system Linux boot fails with "Kernel panic - not syncing: VFS: Unable to mount root fs"

I want to run arm's linux system in gem5's fs mode,I download related files from this address:
http://www.gem5.org/documentation/general_docs/fullsystem/guest_binaries
I was able to configure the correct file path, but finally got this output in the terminal2:
[ 0.661620] No filesystem could mount root, tried:
[ 0.661621] ext3
[ 0.661650] ext4`enter code here`
[ 0.661663] ext2
[ 0.661676] vfat
[ 0.661690] fuseblk
[ 0.661703]
[ 0.661728] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
And got this terrible output in the terminal1:
warn: Tried to read RealView I/O at offset 0x60 that doesn't exist
warn: Tried to write RVIO at offset 0xa8 (data 0) that doesn't exist
warn: Kernel panic in simulated kernel
I can provide my command line input, but simply adjusting the configuration inside will only lead to the same result:
./build/ARM/gem5.opt configs/example/arm/starter_fs.py --cpu="minor" --num-cores=4 --disk-image=/home/ad/GEM5/ARM_GEM5/gem5/my_image/disks/aarch64-ubuntu-trusty-headless.img --dtb=/home/ad/GEM5/ARM_GEM5/gem5/my_image/binaries/armv7_gem5_v1_4cpu.dtb --kernel=/home/ad/GEM5/ARM_GEM5/gem5/my_image/binaries/vmlinux
How can I solve this?
Regards,
Here is a full diagnostic procedure for this kind of problem: https://askubuntu.com/questions/41930/kernel-panic-not-syncing-vfs-unable-to-mount-root-fs-on-unknown-block0-0/1048477#1048477
In summary, you have to ensure that:
the kernel has the config to read the disk type, for emulation usually:
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_BLK=y
This seems to be the problem since there was no list of partitions given above? Please confirm. If not, the kernel can't read bytes from the disk it seems.
the kernel has the config to read the filesystem type. You kernel mentions ext2,3,4 though, so likely that's not the problem.
you are pointing the root= kernel CLI to the right partition
See also: https://cirosantilli.com/linux-kernel-module-cheat/#not-syncing That also contains a Buildroot setup that just works.
I also highly recommend that you first get it working on QEMU which boots much faster.
I had a similar problem. Trying to run a full-system emulation of both Ubuntu and Linaro minimal (from the gem5 website) under a 64-bit kernel, with the original starter_fs.py script, gives me this kernel panic:
[ 0.224367] List of all partitions:
[ 0.224394] fe00 1048320 vda
[ 0.224397] driver: virtio_blk
[ 0.224440] fe01 1048288 vda1 00000000-01
[ 0.224441]
[ 0.224480] No filesystem could mount root, tried:
[ 0.224481] ext3
[ 0.224510] ext4
[ 0.224524] ext2
[ 0.224537] squashfs
[ 0.224551] vfat
[ 0.224566] fuseblk
[ 0.224579]
[ 0.224606] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0)
[ 0.224656] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.18.0+ #1
[ 0.224692] Hardware name: V2P-CA15 (DT)
[ 0.224717] Call trace:
[ 0.224741] dump_backtrace+0x0/0x1c0
[ 0.224765] show_stack+0x14/0x20
[ 0.224790] dump_stack+0x8c/0xac
[ 0.224812] panic+0x130/0x288
[ 0.224836] mount_block_root+0x22c/0x294
[ 0.224861] mount_root+0x140/0x174
[ 0.224884] prepare_namespace+0x138/0x180
[ 0.224910] kernel_init_freeable+0x1c0/0x1e0
[ 0.224939] kernel_init+0x10/0x108
[ 0.224961] ret_from_fork+0x10/0x18
[ 0.224987] Kernel Offset: disabled
[ 0.225009] CPU features: 0x21c06492
[ 0.225032] Memory Limit: 2048 MB
[ 0.225056] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(254,0) ]---
Weird thing is that, few weeks ago, it worked like a charm. The problem lying into the specification of the root partition, on the kernel command line. In the starter_fs.py, change this line:
"root=/dev/vda",
By this:
"root=/dev/vda1",
You can see that before, the VirtIO block device was specified. The kernel wants a partition, not a block device. Then, you can run gem5:
build/ARM/gem5.opt -configs/example/arm/starter_fs.py --cpu="hpi" --num-cores=1 --disk-image="linaro-minimal-aarch64.img" --kernel="vmlinux.arm64"
And for me, the kernel panic is gone and I am able to boot my system again:
[ 0.228847] EXT4-fs (vda1): mounted filesystem without journal. Opts: (null)
[ 0.228906] VFS: Mounted root (ext4 filesystem) on device 254:1.
[ 0.229539] devtmpfs: mounted
[ 0.229792] Freeing unused kernel memory: 448K
INIT: version 2.88 booting
[ 0.234168] random: fast init done
Starting udev
[ 0.277039] udevd[715]: starting version 182
[ 0.411534] EXT4-fs (vda1): re-mounted. Opts: block_validity,delalloc,barrier,user_xattr
Starting Bootlog daemon: bootlogd.
[ 0.426573] random: dd: uninitialized urandom read (512 bytes read)
Populating dev cache
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.all.rp_filter = 1
hwclock: can't open '/dev/misc/rtc': No such file or directory
Mon Jan 27 08:00:00 UTC 2014
hwclock: can't open '/dev/misc/rtc': No such file or directory
INIT: Entering runlevel: 5
Configuring network interfaces... ifconfig: SIOCGIFFLAGS: No such device
Starting rpcbind daemon...rpcbind: cannot create socket for udp6
rpcbind: cannot create socket for tcp6
done.
rpcbind: cannot get uid of '': Success
creating NFS state directory: done
starting statd: done
Starting auto-serial-console: done
Stopping Bootlog daemon:
bootlogd.
Last login: Mon Jan 27 08:00:00 UTC 2014 on tty1
INIT: no more processes left in this runlevel
root#genericarmv8:~# id
id
uid=0(root) gid=0(root) groups=0(root)
root#genericarmv8:~#

Unable to log into docker container syslog file

In a component test, I make a C binary and test it thanks to pytest.
In the C binary I use syslog.h in order to log what happens.
Howewer there is no syslog file in /var/log/ in container, and no information in syslog on host
I have tried to run rsyslog as a service in the container and several rsyslog.conf configuration.
Also I have edited the docker deamon.json to use the syslog service
#pytest.fixture(scope="session")
def generator():
process = subprocess.Popen(["./build/bin/myBinqry","-v"])
yield process
process.terminate()
/* myBinary */
int main(int argc, char **argv)
{
openlog("myBinary", logoptions, LOG_DAEMON);
syslog(LOG_NOTICE, "Service is starting...");
}
I expected syslog entries either on host or on container /var/log but there is nothing
The problem was :
the service rsyslog was not launched
root#xxx:/var/log# service rsyslog restart
[ ok ] Stopping enhanced syslogd: rsyslogd already stopped.
[ ok ] Starting enhanced syslogd: rsyslogd.

Google Compute Engine VM instance: VFS: Unable to mount root fs on unknown-block

My instance on Google Compute Engine is not booting up due to having some boot order issues.
So, I have created a another instance and re-configured my machine.
My questions:
How can I handle these issues when I host some websites?
How can I recover my data from old disk?
logs
[ 0.348577] Key type trusted registered
[ 0.349232] Key type encrypted registered
[ 0.349769] AppArmor: AppArmor sha1 policy hashing enabled
[ 0.350351] ima: No TPM chip found, activating TPM-bypass!
[ 0.351070] evm: HMAC attrs: 0x1
[ 0.351549] Magic number: 11:333:138
[ 0.352077] block ram3: hash matches
[ 0.352550] rtc_cmos 00:00: setting system clock to 2015-12-19 17:06:53 UTC (1450544813)
[ 0.353492] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[ 0.354108] EDD information not available.
[ 0.536267] input: AT Translated Set 2 keyboard as /devices/platform/i8042/serio0/input/input2
[ 0.537862] md: Waiting for all devices to be available before autodetect
[ 0.538979] md: If you don't use raid, use raid=noautodetect
[ 0.539969] md: Autodetecting RAID arrays.
[ 0.540699] md: Scanned 0 and added 0 devices.
[ 0.541565] md: autorun ...
[ 0.542093] md: ... autorun DONE.
[ 0.542723] VFS: Cannot open root device "sda1" or unknown-block(0,0): error -6
[ 0.543731] Please append a correct "root=" boot option; here are the available partitions:
[ 0.545011] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[ 0.546199] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.19.0-39-generic #44~14.04.1-Ubuntu
[ 0.547579] Hardware name: Google Google, BIOS Google 01/01/2011
[ 0.548728] ffffea00008ae140 ffff880024ee7db8 ffffffff817af92b 000000000000111e
[ 0.549004] ffffffff81a7c7c8 ffff880024ee7e38 ffffffff817a976b ffff880024ee7dd8
[ 0.549004] ffffffff00000010 ffff880024ee7e48 ffff880024ee7de8 ffff880024ee7e38
[ 0.549004] Call Trace:
[ 0.549004] [] dump_stack+0x45/0x57
[ 0.549004] [] panic+0xc1/0x1f5
[ 0.549004] [] mount_block_root+0x210/0x2a9
[ 0.549004] [] mount_root+0x54/0x58
[ 0.549004] [] prepare_namespace+0x16d/0x1a6
[ 0.549004] [] kernel_init_freeable+0x1f6/0x20b
[ 0.549004] [] ? initcall_blacklist+0xc0/0xc0
[ 0.549004] [] ? rest_init+0x80/0x80
[ 0.549004] [] kernel_init+0xe/0xf0
[ 0.549004] [] ret_from_fork+0x58/0x90
[ 0.549004] [] ? rest_init+0x80/0x80
[ 0.549004] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 0.549004] ---[ end Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
What Causes This?
That is the million dollar question. After inspecting my GCE VM, I found out there were 14 different kernels installed taking up several hundred MB's of space. Most of the kernels didn't have a corresponding initrd.img file, and were therefore not bootable (including 3.19.0-39-generic).
I certainly never went around trying to install random kernels, and once removed, they no longer appear as available upgrades, so I'm not sure what happened. Seriously, what happened?
Edit: New response from Google Cloud Support.
I received another disconcerting response. This may explain the additional, errant kernels.
"On rare occasions, a VM needs to be migrated from one physical host to another. In such case, a kernel upgrade and security patches might be applied by Google."
1. "How can I handle these issues when I host some websites?"
My first instinct is to recommend using AWS instead of GCE. However, GCE is less expensive. Before doing any upgrades, make sure you take a snapshot, and try rebooting the server to see if the upgrades broke anything.
2. How can I recover my data from old disk?
Even Better - How to recover your instance...
After several back-and-forth emails, I finally received a response from support that allowed me to resolve the issue. Be mindful, you will have to change things to match your unique VM.
Take a snapshot of the disk first in case we need to roll back any of the changes below.
Edit the properties of the broken instance to disable this option: "Delete boot disk when instance is deleted"
Delete the broken instance.
IMPORTANT: ensure not to select the option to delete the boot disk. Otherwise, the disk will get removed permanently!!
Start up a new temporary instance.
Attach the broken disk (this will appear as /dev/sdb1) to the temporary instance
When the temporary instance is booted up, do the following:
In the temporary instance:
# Run fsck to fix any disk corruption issues
$ sudo fsck.ext4 -a /dev/sdb1
# Mount the disk from the broken vm
$ sudo mkdir /mnt/sdb
$ sudo mount /dev/sdb1 /mnt/sdb/ -t ext4
# Find out the UUID of the broken disk. In this case, the uuid of sdb1 is d9cae47b-328f-482a-a202-d0ba41926661
$ ls -alt /dev/disk/by-uuid/
lrwxrwxrwx. 1 root root 10 Jan 6 07:43 d9cae47b-328f-482a-a202-d0ba41926661 -> ../../sdb1
lrwxrwxrwx. 1 root root 10 Jan 6 05:39 a8cf6ab7-92fb-42c6-b95f-d437f94aaf98 -> ../../sda1
# Update the UUID in grub.cfg (if necessary)
$ sudo vim /mnt/sdb/boot/grub/grub.cfg
Note: This ^^^ is where I deviated from the support instructions.
Instead of modifying all the boot entries to set root=UUID=[uuid character string], I looked for all the entries that set root=/dev/sda1 and deleted them. I also deleted every entry that didn't set an initrd.img file. The top boot entry with correct parameters in my case ended up being 3.19.0-31-generic. But yours may be different.
# Flush all changes to disk
$ sudo sync
# Shut down the temporary instance
$ sudo shutdown -h now
Finally, detach the HDD from the temporary instance, and create a new instance based off of the fixed disk. It will hopefully boot.
Assuming it does boot, you have a lot of work to do. If you have half as many unused kernels as me, then you might want to purge the unused ones (especially since some are likely missing a corresponding initrd.img file).
I used the second answer (the terminal-based one) in this askubuntu question to purge the other kernels.
Note: Make sure you don't purge the kernel you booted in with!
How to handle these issues when I host some websites?
I'm not sure how you got into this situation, but it would be nice to have additional information (see my comment above) to be able to understand what triggered this issue.
How to recover my data from old disk?
Attach and mount the disk
Assuming you did not delete the original disk when you deleted the instance, you can simply mount this disk from another VM to read the data from it. To do this:
attach the disk to another VM instance, e.g.,
gcloud compute instances attach-disk $INSTANCE --disk $DISK
mount the disk:
sudo mkdir -p /mnt/disks/[MNT_DIR]
sudo mount [OPTIONS] /dev/disk/by-id/google-[DISK_NAME] /mnt/disks/[MNT_DIR]
Note: you'll need to substitute appropriate values for:
MNT_DIR: directory
OPTIONS: options appropriate for your disk and filesystem
DISK_NAME: the id of the disk after you attach it to the VM
Unmounting and detaching the disk
When you are done using the disk, reverse the steps:
Note: Before you detach a non-root disk, unmount the disk first. Detaching a mounted disk might result in incomplete I/O operation and data corruption.
unmount the disk
sudo umount /dev/disk/by-id/google-[DISK_NAME]
detach the disk from the VM:
gcloud compute instances detach-disk $INSTANCE --device-name my-new-device
In my case grub's (/boot/grub/grub.cfg) first menuentry (3.19.0-51-generic) was missing an initrd entry and was unable to boot.
Upon further investigating, looking at dpkg for the specific kernel its marked as failed and unconfigured
dpkg -l | grep 3.19.0-51-generic
iF linux-image-3.19.0-51-generic 3.19.0-51.58~14.04.1
iU linux-image-extra-3.19.0-51-generic 3.19.0-51.58~14.04.1
This all stemmed from the Ubuntu image supplied by Google having unattended-upgrades enabled. For some reason the initrd was killed when it was being built and something else came along and ran update-grub2.
unattended-upgrades-dpkg_2016-03-10_06:49:42.550403.log:update-initramfs: Generating /boot/initrd.img-3.19.0-51-generic
Killed
E: mkinitramfs failure cpio 141 xz -8 --check=crc32 137
unattended-upgrades-dpkg_2016-03-10_06:49:42.550403.log:update-initramfs: failed for /boot/initrd.img-3.19.0-51-generic with 1.
To work around the immediate problem run.
dpkg --force-confold --configure -a
Although unattended-upgrades in theory is a great idea, having it enabled by default can have unattended consequences.
There are a few cases where the kernel fails to handle the initrdless boot. Disable the GRUB_FORCE_PARTUUID options so that it boots with initrd.

Problems regarding nagios.cfg with defined services and hosts (generated by Nagiosql)

Nagiosql generated files make problems during preflight check - but everythings seems to be okay.
/etc/nagios/nagios.cfg
....
## Hosts
cfg_dir=/etc/nagiosql/hosts/
cfg_file=/etc/nagiosql/hosttemplates.cfg
cfg_file=/etc/nagiosql/hostgroups.cfg
cfg_file=/etc/nagiosql/hostextinfo.cfg
cfg_file=/etc/nagiosql/hostescalations.cfg
cfg_file=/etc/nagiosql/hostdependencies.cfg
## Services
cfg_dir=/etc/nagiosql/services/
cfg_file=/etc/nagiosql/servicetemplates.cfg
cfg_file=/etc/nagiosql/servicegroups.cfg
cfg_file=/etc/nagiosql/serviceextinfo.cfg
cfg_file=/etc/nagiosql/serviceescalations.cfg
cfg_file=/etc/nagiosql/servicedependencies.cfg
...
nagios -v /etc/nagios/nagios.cfg
....
Running pre-flight check on configuration data...
Checking services...
Error: There are no services defined!
Checked 0 services.
Checking hosts...
Error: There are no hosts defined!
Checked 0 hosts.
The content seems okay to me
[root#xxx services]# cd /etc/nagiosql/services/
[root#xxx services]# ls -alh
total 20K
drwsr-sr-x 2 apache nagios 4.0K Aug 7 10:46 .
drwsr-sr-x 5 apache nagios 4.0K Aug 7 12:17 ..
-rw-r--r-- 1 apache nagios 2.3K Aug 7 10:46 localhost.cfg
-rw-r--r-- 1 apache nagios 2.2K Aug 7 10:46 www.google.com.cfg
-rw-r--r-- 1 apache nagios 1.1K Aug 7 10:46 www.yahoo.com.cfg
[root#xxx hosts]# ls -alh
total 16K
drwsr-sr-x 2 apache nagios 4.0K Aug 11 07:12 .
drwsr-sr-x 5 apache nagios 4.0K Aug 7 12:17 ..
-rw-r--r-- 1 apache nagios 800 Aug 11 07:12 GIT.cfg
-rw-r--r-- 1 apache nagios 948 Aug 11 07:12 psm01.cfg
Content also seems to be fine (generated by nagiosql):
[root#xxx hosts]# vi GIT.cfg
###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.2.0
# Date: 2015-08-11 07:12:54
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################
define host {
host_name GIT
alias GIT Server
address 172.25.10.80
register 0
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
~
Can somebody tell me where the solution for this problem is? Already wasted 2 hours...
Try removing the final slash from the directory names in your cfg_dir definitions and see if that doesn´t get it to recognize the cfg files in that directory.
For example,
Change:
cfg_dir=/etc/nagiosql/hosts/
...
cfg_dir=/etc/nagiosql/services/
To:
cfg_dir=/etc/nagiosql/hosts
...
cfg_dir=/etc/nagiosql/services
EDIT:
Okay I think directory permissions may be causing the cfg_dir evaulations to fail. According to the ls -alh output you listed, your /etc/nagiosql/hosts/, /etc/nagiosql/services/, and /etc/nagiosql/ directories do not grant write permissions to the nagios group. Nagios will need to get a directory listing for those directories and will need group write permissions to do it.
To remedy:
chmod g+w /etc/nagiosql/hosts/
chmod g+w /etc/nagiosql/services/
Restart nagios service.
Also, you don't need to remove the slashes from the directory paths in the nagios cfg_dir configurations. Nagios will strip the trailing slash (/) for you, according to the code:
https://github.com/NagiosEnterprises/nagioscore/blob/eb8e83d5d05e572eb8c0d4d4764885c5427b4b69/xdata/xodtemplate.c#L327
/* process all files in a config directory */ else if(!strcmp(var, "xodtemplate_config_dir") || !strcmp(var, "cfg_dir")) {
if(config_base_dir != NULL && val[0] != '/') {
asprintf(&cfgfile, "%s/%s", config_base_dir, val);
} else
cfgfile = strdup(val);
/* strip trailing / if necessary */ if(cfgfile != NULL && cfgfile[strlen(cfgfile) - 1] == '/')
cfgfile[strlen(cfgfile) - 1] = '\x0';
/* process the config directory... */ result = xodtemplate_process_config_dir(cfgfile, options);
my_free(cfgfile);
/* if there was an error processing the config file, break out of loop */ if(result == ERROR)
break; } }
EDIT #2: In the host definition you posted, your register value is set to 0. Try setting it to 1 instead. register 0 is used for templates that will be inherited from, but will not actually show up in the Nagios UI.
Change:
define host {
host_name GIT
alias GIT Server
address 172.25.10.80
register 0
}
To:
define host {
host_name GIT
alias GIT Server
address 172.25.10.80
register 1
}
Also please set register 1 for your service definitions as well.
Try adding executable permissions to your directories. Some programs and languages require +x permissions in order to actually open the directory.
If that doesn't work, temporarily set everything to 0777 permissions to see if the issue is permissions related at all.
You also have config problems even if you get that part working. Your host and service configs don't have a use directive in them, which points to a template that would have most of the default values. The register directive is implied as 1 unless you specifically set it to 0 for a template. See the object definitions docs if you need a reference: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html

Cassandra 2.0.7 to 2.1.2 sstable upgradesstables, compaction problems

We upgraded Cassandra (5+5 nodes) 2.0.9 to 2.1.2 (binaries) and ran nodetool upgradesstables one-by-one (bash script), after this we observe some problems:
on every node we observe about 50 "Pending Tasks" on one of them more than 500, it has persist for 5 days - when we started nodetool upgradesstables, even if concurrent_compactors is set to 8 cassandra never run more than 3-4 at the same time. One node with more than 500 tasks pending has about 11k files in column family directory...
we have 2 ssd disks but during compacting there is up to 10MB/s reads and maximum 5MB/s writes - even if compaction_throughput_mb_per_sec is set to 32 or 64 or 256
during upgradesstables on some tables got :
WARN [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 ColumnFamilyStore.java:2492 - Unable to cancel in-progress compactions for reco_active_items_v1. Perhaps there is an unusually large row in progress somewhere, or the system is simply overloaded.
INFO [RMI TCP Connection(100)-10.64.72.34] 2014-12-21 23:53:18,953 CompactionManager.java:247 - Aborting operation on reco_prod.reco_active_items_v1 after failing to interrupt other compaction operations
nodetool is failing with:
Aborted upgrading sstables for atleast one column family in keyspace reco_prod, check server logs for more information.
on some nodes nodetool upgradesstables finished succefully but still can see jb files in column family directory.
nodetool upgradesstables on some nodes returns:
error: null
-- StackTrace --
java.lang.NullPointerException
at org.apache.cassandra.io.sstable.SSTableReader.cloneWithNewStart(SSTableReader.java:952)
at org.apache.cassandra.io.sstable.SSTableRewriter.moveStarts(SSTableRewriter.java:250)
at org.apache.cassandra.io.sstable.SSTableRewriter.switchWriter(SSTableRewriter.java:300)
at org.apache.cassandra.io.sstable.SSTableRewriter.abort(SSTableRewriter.java:186)
at org.apache.cassandra.db.compaction.CompactionTask.runWith(CompactionTask.java:204)
at org.apache.cassandra.io.util.DiskAwareRunnable.runMayThrow(DiskAwareRunnable.java:48)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:75)
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:59)
at org.apache.cassandra.db.compaction.CompactionManager$4.execute(CompactionManager.java:340)
at org.apache.cassandra.db.compaction.CompactionManager$2.call(CompactionManager.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This is our production env (24h) and we observe higher load on nodes , higher read latency even more than 1 sec.
Any advise...?

Resources