Increasing Solr5 time out from 30 seconds while starting solr - solr

Many time while starting solr I see the below message and then the solr is not reachable.
debraj#boutique3:~/solr5$ sudo bin/solr start -p 8789
Waiting to see Solr listening on port 8789 [-] Still not seeing Solr listening on 8789 after 30 seconds!
I am having two cores in my local set-up. I am guessing this is happening because one of the core is a little big. So solr is timing out while loading the core. If I take one of the core out of solr then everything works fine.
Can some one let me know how can I increase this timeout value from default 30 seconds?
I am using Solr 5.2.1 on Debian 7.

Usually this could be related to startup problems, but if you are running solr on a slow machine, 30 seconds may not be enough for it to start.
In that case you may try this (I'm using Solr 5.5.0)
Windows (tested, working): in bin/solr.cmd file, look for the parameter
-maxWaitSecs 30
few lines below "REM now wait to see Solr come online ..." and replace 30 with a number that meets your needs (e.g. 300 seconds = 5 minutes)
Others (not tested): in bin/solr file, search the following code
if [ $loops -lt 6 ]; then
sleep 5
loops=$[$loops+1]
else
echo -e "Still not seeing Solr listening on $SOLR_PORT after 30 seconds!"
tail -30 "$SOLR_LOGS_DIR/solr.log"
exit # subshell!
fi
Increase waiting loop cycles from 6 to whatever meets your needs (e.g. 60 cycles * 5 sleep seconds = 300 seconds = 5 minutes). You should change the number of seconds in the message below too, just to be congruent.

Related

Rebalancing rate when new node is added

When a new node is added, we see that it is starting to receive new tablets (in the http://:7000/tablet-servers page) and the system is rebalancing. But the default rate seems low. Are there any knobs to determine this rate?
The rebalance in YugaByte DB is rate limited.
One of the parameters that governs this behavior is the yb-tserver gflag remote_bootstrap_rate_limit_bytes_per_sec which defaults to 256MB/sec and is the maximum transmission rate (inbound + outbound) related to rebalance that any one server (yb-tserver) may do.
To inspect the current setting on a yb-tserver you can try this:
$ curl -s 10.150.0.20:9000/varz | grep remote_bootstrap_rate
--remote_bootstrap_rate_limit_bytes_per_sec=268435456
This particular param can also be changed on the fly without needing a yb-tserver restart. For example to set the rate to 512MB/sec.
bin/yb-ts-cli --server_address=$TSERVER_IP:9100 set_flag --force remote_boostrap_rate_limit_bytes_per_sec 536870912
A second aspect of this is the cluster wide global settings on how many tablet rebalances can happen simultaneously in the system. These are governed by a few yb-master gflags.
$ bin/yb-ts-cli --server_address=$MASTER_IP:7100 set_flag -force load_balancer_max_concurrent_adds 3
$ bin/yb-ts-cli --server_address=$MASTER_IP:7100 set_flag -force load_balancer_max_over_replicated_tablets 3
$ bin/yb-ts-cli --server_address=$MASTER_IP:7100 set_flag -force load_balancer_max_concurrent_tablet_remote_bootstraps 3

Job distribution between nodes on HPC, instead of 1 CPU cores

I am using PBS, HPC to submit serially written C codes. I have to run suppose 5 codes in 5 different directories. when I select 1 node and 5 cores select=1:ncpus=5, and submits it with ./submit &. It forks and runs all the 5 jobs. The moment I choose 5 node and 1 cores select=5:ncpus=1, and submits it with ./submit &. Only 1 core of the first node runs all five jobs and rest 4 threads are free, speed decreased to 1/5.
My question is, Is it possible to fork the job between the nodes as well?
because when I select on HPC select=1:ncpus=24 it gets to Que instead select=4:ncpus=6 runs.
Thanks.
You should consider using job arrays (using option #PBS -t 1-5) with I node and 1 cpu each. Then 5 independent jobs will start and your job will wait less in the queue.
Within your script you can use environment variable PBS_ARRAYID to identify the task and use it to set appropriate directory and start the appropriate C code. Something like this:
#!/bin/bash -l
#PBS -N yourjobname
#PBS -q yourqueue
#PBS -l nodes=1:ppn=1
#PBS -t 1-5
./myprog-${PBS_ARRAYID}.c
This script will run 5 jobs and each of them will run programs with a name myprog-*.c where * is a number between 1 and 5.

Working with ping localhost in Batch

We use ping localhost -n 2 >nul to delay its following executions.
We can change 2 to the number of seconds needed.
How can I control this in a much broader way? I tried using 1.5 instead of 2 and it didn't work.
Is there any code by which we can change the unit of time?
EDIT: Instead of ping localhost -n 2 >nul. I'm using TIMEOUT 1 >nul.
The command timeout is the best choice for waiting a specific time in a batch file which is designed for execution on Windows 7 and later versions of Windows. It supports breaking the timeout by the user with any key except /NOBREAK is specified as parameter. And it shows a nice message with a seconds countdown for the user informing also the user how to break the timeout. But it supports only timeout values in seconds, not in milliseconds.
The command sleep could be also used on Windows XP and later versions of Windows when having access to Windows 2003 resource kit and this small executable is copied to all computers running the batch file. But this executable is deprecated because of being replaced by TIMEOUT and by default not installed on any Windows computer.
But a good choice for all Windows is using the command ping for pinging the loopback adapter or a not reachable IP address with using appropriate values of the options -n and -w for the delay.
The IP address of the loopback adapter of local machine is 127.0.0.1, see Wikipedia articles about Reserved IP addresses. localhost is just an RFC defined alias for 127.0.0.1 defined on Windows XP and former Windows versions in file %SystemRoot%\system32\drivers\etc\hosts and is defined built-in on Windows Vista and later Windows versions.
The first ping of 127.0.0.1 is always immediately successful. Therefore using command PING with -n 1 as option gives just a delay of approximately a millisecond in total.
For that reason using PING as delay on pinging 127.0.0.1requires a value greater 1 for option -n ... number of echo requests to send. After a successful request PING waits about 1 second before making the next request.
So for a delay of 5 seconds the following command line is necessary with 6 echo requests:
%SystemRoot%\System32\ping.exe 127.0.0.1 -n 6 >nul
Note 1: Windows is not a real-time operating system and for that reason the time is not 100% accurate, but should be good enough for a batch file.
The option -w defines in milliseconds how long ping (Microsoft documentation) waits for an echo on the request. It does not define the time between two successful requests. Therefore this option can't be used to fine tune the delay on pinging the IP address 127.0.0.1 as this request is successful in less than 1 millisecond and value of option -w does not matter.
So for a delay in milliseconds instead of seconds it is necessary to ping an IP address which is definitely or at least most likely not reachable and which is not routed via networks because of being a private network address according to RFC 6761.
An example is:
%SystemRoot%\System32\ping.exe 168.192.255.253 -n 1 -w 1500
The IPv4 address range from 168.192.0.0 to 192.168.255.255 is for private networks. The highest address 192.168.255.255 in this network is the broadcast address and is not used for devices. It is common to configure a router with local area network broadcast address minus 1 which means 192.168.255.254 could be assigned to a router in case of current computer is part of this private network. And other devices in a LAN get assigned usually the IPv4 addresses from lowest address plus 1 upwards. Therefore for IPv4 network 168.192.0.0/16 the IP address 168.192.255.253 is most likely not assigned to any device which would respond on the echo request of PING.
Well, the milliseconds delay is not very accurate. But is it really important on execution of a batch file to wait exactly 1500 ms?
Note 2: This approach does not work if the computer on which the batch file is running is currently not connected to any network. Without any network connection each echo request is always immediately terminated and PING outputs for each echo request the error message:
PING: transmit failed. General failure.
The general failure is no network (connection) present at all and therefore only echo requests to local loopback adapter work.
Unfourtanelty, #thx1138v2's solution only delays 0.04 seconds on my machine. Therefore, I've modified his solution to make it more accurate.
ping 1.1.1.1 -n 1 -w 1500 >nul
1500 stands for 1500 milliseconds, which is 1.5 seconds.
ping is inaccurate when pinging a small amount of time, see this table:
Milliseconds In Code | Actual Waited Time
1500 | 1.24 seconds - 1240 milliseconds
1600 | 1.34 seconds - 1360 milliseconds
1700 | 1.52 seconds - 1520 milliseconds
As you can see, 1700 ms's wait time is much precise than 1500 ms, so you may need to consider some extra milliseconds.
Note: ping only supports delay more than 99 milliseconds
-n is the (n)umber of times to ping, not the amount of time to wait. You can't ping 1.5 times.
-w is the time to (w)ait on each ping in milliseconds. To pause 1.5 seconds would be
ping -n 3 -w 500
If there is a web site set up on the machine running the batch file the ping will find it as localhost and the timeout will not apply. The timeout only applies to failed requests. It is better to ping 0.0.0.1 for a delay.
ping -n 3 -w 500 0.0.0.1

How to ignore timeouts in ab (apache bench)?

I run benchmarks with apache bench on a web service. I know that 1-2 requests from the test will be timeouted during measurement (it's a web framework issue). And when timeout occurs ab quits with the message apr_pollset_poll: The timeout specified has expired (70007) and does not show results. I want to get measurement results ignoring these timeouted tests (or count them too, but just use timeout value as response time). Is it possible with ab?
EDIT: The command I use is
ab -n 1000 -c 10 http://localhost:80
I looked into ab source and from what I saw it's impossible to ignore these errors. Maybe there is a fork which implements such feature?
The default timeout is 30 seconds. You can change this with -s:
ab -s 9999 -n 1000 -c 10 http://localhost:80

libssh2 - channel read is hanging

I'm currently developing a remote job scheduler on perl.
It has to connect via ssh to x servers and execute already defined jobs/jobs groups.
I use Net:SSH2 which is build upon libssh2.
My program usually works fine with like 400/500 servers, but when i try to run the basic uptime command on 1000 servers, one or more of my threads hangs and never finishes, or like 30 minutes after.
It's random : sometimes it finishes on time, sometimes not.
I tracked the problem as coming from this Net::SSH2 command : $in .= $buf while $chan->read( $buf, 10240 );
Here is the full code of the connection :
my $chan = $this->{netssh2}->channel() or die $!;
$chan->blocking(1);
$chan->exec($command);
my ($in,$err,$buf,$buf_err);
$in .= $buf while $chan->read( $buf, 10240 );
$err .= $buf_err while $chan->read( $buf_err, 10240, 1 );
$chan->send_eof;
1 while !$chan->eof;
$chan->wait_closed;
I then downloaded a Net::SSH2 source package and modified the C-perl linking (xs) file.
It showed me that the problem comes from this line :
count = libssh2_channel_read_ex(ch->channel, XLATEXT, pv_buffer, size);
This command comes with the libssh2 library : http://www.libssh2.org/libssh2_channel_read_ex.html
Sometimes (about 1 in 1000 times) the program enters this read and never leaves. Servers affected are differents most of the time.
Do you have any idea what I should be looking for/checking ?
I've been working on this for a few day, I'd like an external advice very much :)

Resources