Zookeeper timeouts without error in zookeeper Solr - solr

We are facing issue with solr/zookeeper where zookeeper timeouts after 10000ms. Error below.
SolrException: java.util.concurrent.TimeoutException: Could not connect to ZooKeeper <server1>:9181,<server2>:9182,<server2>:9183 within 10000 ms.
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:184)
at org.apache.solr.common.cloud.SolrZkClient.<init>(SolrZkClient.java:121)
We are not getting any error in zookeeper logs.Except below logs
2018-12-19 04:35:22,305 [myid:2] - INFO [SessionTracker:ZooKeeperServer#354] - Expiring session 0x200830234de3127, timeout of 10000ms exceeded
2018-12-19 05:35:38,304 [myid:2] - INFO [SessionTracker:ZooKeeperServer#354] - Expiring session 0x200b4f912730086, timeout of 10000ms exceeded
During the issue threads go high and we could notice below in weblogic server.
Name: Connection evictor
State: TIMED_WAITING
Total blocked: 0  Total waited: 1
Stack trace:
java.lang.Thread.sleep(Native Method)
org.apache.http.impl.client.IdleConnectionEvictor$1.run(IdleConnectionEvictor.java:66)
java.lang.Thread.run(Thread.java:748)
What could be going wrong here?

In my experience, ZK timeouts have almost always been due to something on the Solr node, rather than a problem in ZK.
You don't provide all the timestamps, but the theory is that:
Solr fails to send the heartbeat for some reason
ZK assumes the client has gone away and closes the connection
Solr tries to use the connection that ZK closed
So why might the Solr node fail to send the heartbeat? This could be because the Solr node was simply overloaded, (Is the thread spike a cause, or a symptom?) or just working through a very long GC pause could do it too.

Related

Apache2: how to log rejected connections and client timeout

I am doing some load testing on a service run with Apache2 and my load testing tool has a default timeout of 30 seconds. When I run the tool for a minute with 1 request per second load, it reports that 40 succeeded with 200 OK response and 20 requests were cancelled because client timeout exceeded while awaiting headers.
Now, I was trying to spot this on the server side. I can't see the timeouts logged either in apache access logs or gunicorn access logs. Note that I am interested in connections that weren't accepted as well as that are accepted and times out.
I have some experience working on similar services on Windows. The http.sys error logs would show connection dropped errors and we would know if our server was dropping connections.
When a client times out, all the server knows is that the client has aborted the connection. In mod_log's config, the %X format specifier is used to log the status of the client connection after the request has completed, which is exactly what you want to know in this case.
Configure your logs to use %X, and look for the X character in the log lines.
Bonus: I even found the discussion about this feature in apache's dev forum, from 20 years ago
Update:
Regarding refused connections, these cannot be logged by apache. Connection refusal is done by the kernel, in the tcp stack, and not by apache. The closest solution including only apache that I can think of is keeping track of the amount of open connections (using mod_status). If it reaches the maximum you know you might be refusing connections. Otherwise, you'd need to set up some monitoring solution to track tcp resets sent by the kernel.

Session timeout with apache web server and weblogic cluster. JSESSIONID in response is not the same as the one in request

We have an apache web server (version httpd-2.2.22-win32-x86-openssl-0.9.8t) with weblogic (version 10.3.2) cluster having 3 nodes. In our load testing, we get session timeout errors in some cases (less than 1%). This was happening, even if we have -1 for session timeout in the web.xmls of weblogic nodes. After days of debugging, we realized that in some cases, the JSESSIOID sent by request is not honored by the response. Fiddler traces show that the RESPONSE has a header named Set-Cookie:JSESSIONID and the value for this is different from the JSESSIONID sent in the request. We get the session expiry page immediately. As already mentioned, this happens only in some rare cases.
When using WeblogicCluster, the requests have session affinity. Thus the requests go to the same node where the initial contact was made. But the issue turned out to be that at high loads, the nodes were not responding. Thus the requests go to the other nodes. This is the default behavior with WeblogicCluster. Since we do not have session replication and failover enabled, any request that goes to the secondary nodes would give us session timeout error.
One solution to this was to start supporting session replication and failover in weblogic. But we did not want that as the impact was high.
These were the configuration changes that fixed this issue
In httpd.conf
ConnectTimeoutSecs 50 (default is 10)
ConnectRetrySecs 5 (default is 2)
WLSocketTimeoutSecs 10 (default is 2)
WLIOTimeoutSecs 18000 (default is 300)
Idempotent OFF (default is ON)
The first 2 changes in ConnectTimeoutSecs and ConnectRetrySecs means that
retry would 10 times (50/5) instead of the default 5 (10/2)
In weblogic nodes
domain --> environment --> servers --> click on the required server -->
tuning--> Accept Backlog:
--> default value is 300. Made it 375.
restart the weblogic nodes and apache
For more details refer
http://docs.oracle.com/cd/E13222_01/wls/docs81/plugins/plugin_params.html
http://docs.oracle.com/cd/E13222_01/wls/docs81/plugins/apache.html . Refer the diagram here for the

Hanging ActiveMQ Transport and Connection threads

I've got a Webservice deployed on Apache ServiceMix which uses Apache Camel to invoke an ActiveMQ driven route using code similar to the following:
context.createProducerTemplate().sendBody("activemq:startComplex", xml);
The invocation works fine but after some time the file descriptor limit on my Linux machine gets hit. The resources are eaten up by a whole bunch (a few thousand) of ActiveMQ threads. Under the jmx console I can see a lot of threads similar to the following:
Name: ActiveMQ Transport: tcp://localhost/127.0.0.1:61616
State: RUNNABLE
Total blocked: 0 Total waited: 0
Stack trace:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:129)
org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:5 0)
org.apache.activemq.transport.tcp.TcpTransport$2.fill(TcpTransport.java:589)
org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:5 8)
org.apache.activemq.transport.tcp.TcpTransport$2.read(TcpTransport.java:574)
java.io.DataInputStream.readInt(DataInputStream.java:370)
org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:275)
org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:222)
org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:214)
org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:197)
java.lang.Thread.run(Thread.java:662)
and
Name: ActiveMQ Transport: tcp:///127.0.0.1:46420
State: RUNNABLE
Total blocked: 0 Total waited: 2
Stack trace:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:129)
org.apache.activemq.transport.tcp.TcpBufferedInputStream.fill(TcpBufferedInputStream.java:50)
org.apache.activemq.transport.tcp.TcpTransport$2.fill(TcpTransport.java:589)
org.apache.activemq.transport.tcp.TcpBufferedInputStream.read(TcpBufferedInputStream.java:58)
org.apache.activemq.transport.tcp.TcpTransport$2.read(TcpTransport.java:574)
java.io.DataInputStream.readInt(DataInputStream.java:370)
org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:275)
org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:222)
org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:214)
org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:197)
java.lang.Thread.run(Thread.java:662)
And ideas how to get rid of the hanging threads?
See this FAQ
http://camel.apache.org/why-does-camel-use-too-many-threads-with-producertemplate.html
You should not create a new producer template on every message send. And if you do, then remember to close it after usage.
I managed to get rid of the leaking threads issue by dropping all use of ProducerTemplate and ConsumerTemplate.
I am now using standard JMS APIs to send and receive messages from ActiveMQ.

Client requested session xx that was terminated due to FORWARDING_TO_NODE_FAILED

After few hours of working with new selenium 2.20 i get this error:
WARNING: Client requested session 1331671421031 that was terminated
due to FORWARDING_TO_NODE_FAILED
14.3.2012 5:46:55 org.openqa.grid.internal.ActiveTestSessions getExistingSession
what can be wrong. I never get this message with 2.19 version of selenium server.
There is also no other messages just this message all the time in console.
Per selenium wiki -
FORWARDING_TO_NODE_FAILED - The hub was unable to forward to the node. Out of memory errors/node stability issues or network problems
Are you sure you didn't have network related issue?

Tomcat 6 response writing

I am observing strange behavior of my tomcat server, it seems like tomcat is not writing response to the client fast enough. Here is what I am seeing:
When firing aound 200 requests at the same time at my tomcat server, my application logs shows that my servlet's doGet() finishes process the request in about 500ms. However, at the client side the average response time is about 30 seconds (which means client start seeing response from tomcat after 30 seconds)!
Does anyone have any idea about how come there are such long delay between the end of my servlet's process time and the time when client receives response?
My server is hosted on Rackspace VM.
Found the culprit. I observed that the hosting server was using abnormally high CPU usage for even for only few requests, so I attached JConsole to Tomcat and found that all my worker thread has high blocking count... and are constantly in blocking state. Looking at the stack trace the locking happened during JAXBContext instantiation. Digg further, the application creating JAXBContext, which is relatively expensive, for each request.
So in summary, the problem was caused by JAXBContext instantiation per thread. Solution was to ensure JAXBContext is created once per application.

Resources