We are experiencing a very typical issue where ActiveMq(integrated camel) client connection getting set/resetting very frequent for some period, due to which application which uses the jms connection does not work properly for the down town.
After the period, connections are getting refreshed successfully. We are not finding any clue on why connections go down in between. ActiveMq running single mode(one broker and no cluster)
I could see all tcp sockets are up with netstat command.
Please find the below errors.
2015-08-20 23:49:08,084 | WARN | al-01/10.224.240.109:61616#41545 | CachingConnectionFactory | 135 - org.springframework.jms - 3.0.7.RELEASE | Encountered a JMSException - resetting the underlying JMS Connection
javax.jms.JMSException: java.io.EOFException
at org.apache.activemq.util.JMSExceptionSupport.create(JMSExceptionSupport.java:49)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.ActiveMQConnection.onAsyncException(ActiveMQConnection.java:1949)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.ActiveMQConnection.onException(ActiveMQConnection.java:1966)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.ResponseCorrelator.onException(ResponseCorrelator.java:126)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.TransportFilter.onException(TransportFilter.java:101)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.WireFormatNegotiator.onException(WireFormatNegotiator.java:160)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.AbstractInactivityMonitor.onException(AbstractInactivityMonitor.java:295)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.TransportSupport.onException(TransportSupport.java:96)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.nio.NIOTransport.serviceRead(NIOTransport.java:98)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.nio.NIOTransport$1.onSelect(NIOTransport.java:69)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.nio.SelectorSelection.onSelect(SelectorSelection.java:94)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at org.apache.activemq.transport.nio.SelectorWorker$1.run(SelectorWorker.java:119)[177:org.apache.activemq.activemq-core:5.7.0.1-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)[:1.7.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)[:1.7.0_65]
at java.lang.Thread.run(Unknown Source)[:1.7.0_65]
Caused by: java.io.EOFException
Sometimes its refusing client connection like below
2015-08-20 23:49:13,138 | WARN | er[MONITORING_ping_itg://itg:13] | faultJmsMessageListenerContainer | 135 - org.springframework.jms - 3.0.7.RELEASE | Could not refresh JMS Connection for destination 'temporary' - retrying in 5000 ms. Cause: Could not connect to broker URL: nio://msc-pcen-portal-01:61616. Reason: java.net.ConnectException: Connection refused
Can you please somebody comment on it what are the possibilities of client connection resets as our application completely relying on reliability of the connection. If it is the expected behavior with JMS broker, can you list any solutions to the issue.
Thanks in advance
Related
I setup a new Flink cluster (v1.15) in a Kubernetes cluster. This new cluster is setup in the same namespace in which an existing Flink cluster (v1.13) is running fine.
The job-manager of the new Flink cluster is in a CrashLoopBackOff state. job-manager prints the following set of messages continuously, which includes a specific ERROR message:
2022-10-10 23:02:47,214 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,214 ERROR org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState [] - Authentication failed
2022-10-10 23:02:47,215 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established, initiating session, client: /100.98.125.116:57754, server: flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session establishment complete on server flink-zk-client-service/100.65.161.135:2181, sessionid = 0x381a609d51f0082, negotiated timeout = 4000
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: RECONNECTED
2022-10-10 23:02:47,216 INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,218 ERROR org.apache.flink.shaded.curator5.org.apache.curator.framework.recipes.leader.LeaderLatch [] - getChildren() failed. rc = -6 <============
2022-10-10 23:02:47,218 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Unable to read additional data from server sessionid 0x381a609d51f0082, likely server has closed socket, closing socket connection and attempting reconnect
It seems the error message indicates that a specific node of the new cluster in ZK either does not exist or does not have any children. But I could be off. I am using the same zookeeper for the v1.13 cluster, no issues with that cluster.
Content of zookeeper. (ClusterId is - dev-cl2):
ls /dev-cl2/dev-cl2
[leader]
ls /dev-cl2/dev-cl2/leader
[]
get /dev-cl2/dev-cl2/leader
// Nothing printed
Any help or pointers to troubleshooting this issue would be greatly appreciated. Thank you.
Update1:
I noticed the following the 1.15 release notes.
A new multiple component leader election service was implemented that only runs a single leader election per Flink process. If this should cause any problems, then you can set high-availability.use-old-ha-services: true in the flink-conf.yaml to use the old high availability services.
As a test, I set high-availability.use-old-ha-services: true. Did not have any effect.
I have set up a flink standalone cluster, with one master and three slaves , all SESU Linux machines. In the master Dashboard http://flink-master:8081/ I can see 3 Task Managers and 3 task slots as I have set taskmanager.numberOfTaskSlots: 1 in flink-conf.yaml in all of the slaves.
When I run a flink built-in program,like the examples/streaming/Iteration.jar,I get exception often:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
at org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Caused by: java.net.ConnectException: Connection refused: ccr202/127.0.0.2:49651
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 6 more
It seems that the network causes the problem,but sometimes the flink program can successfully finish.So what is the reason?
I also encounter this issue very frequently especially when there are many taskManagers. There are a few config I have tried to solve this issue. It's happened when the taskManager read the remote partition through netty connection. It timed out when request the connection. I increased the config "taskmanager.network.netty.server.numThreads", it solved the issue.
Getting below error from code deployed on WAS at:
org.springframework.messaging.MessageHandlingException:
error occurred in message handler [org.springframework.integration.aggregator.AggregatingMessageHandler#0];
nested exception is org.springframework.jdbc.CannotGetJdbcConnectionException:
Could not get JDBC Connection;
nested exception is com.ibm.websphere.ce.cm.ConnectionWaitTimeoutException:
Connection not available, Timed out waiting for 180000
Detailed Trace:
Caused by: org.springframework.jdbc.CannotGetJdbcConnectionException: Could not get JDBC Connection; nested exception is com.ibm.websphere.ce.cm.ConnectionWaitTimeoutException: Connection not available, Timed out waiting for 180000
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:80)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:630)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:695)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:727)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:752)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:762)
at org.springframework.integration.jdbc.JdbcMessageStore.getMessageGroup(JdbcMessageStore.java:431)
at org.springframework.integration.aggregator.AbstractCorrelatingMessageHandler.handleMessageInternal(AbstractCorrelatingMessageHandler.java:388)
at org.springframework.integration.handler.AbstractMessageHandler.handleMessage(AbstractMessageHandler.java:78)
... 157 more
Caused by: com.ibm.websphere.ce.cm.ConnectionWaitTimeoutException: Connection not available, Timed out waiting for 180000
at com.ibm.ws.rsadapter.AdapterUtil.toSQLException(AdapterUtil.java:1684)
at com.ibm.ws.rsadapter.jdbc.WSJdbcDataSource.getConnection(WSJdbcDataSource.java:686)
at com.ibm.ws.rsadapter.jdbc.WSJdbcDataSource.getConnection(WSJdbcDataSource.java:636)
at org.springframework.jdbc.datasource.DataSourceUtils.doGetConnection(DataSourceUtils.java:111)
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:77)
... 165 more
Caused by: com.ibm.websphere.ce.j2c.ConnectionWaitTimeoutException: Connection not available, Timed out waiting for 180000
at com.ibm.ejs.j2c.FreePool.createOrWaitForConnection(FreePool.java:1729)
at com.ibm.ejs.j2c.PoolManager.reserve(PoolManager.java:3329)
at com.ibm.ejs.j2c.PoolManager.reserve(PoolManager.java:2610)
at com.ibm.ejs.j2c.ConnectionManager.allocateMCWrapper(ConnectionManager.java:1500)
at com.ibm.ejs.j2c.ConnectionManager.allocateConnection(ConnectionManager.java:1012)
at com.ibm.ws.rsadapter.jdbc.WSJdbcDataSource.getConnection(WSJdbcDataSource.java:669)
... 168 more
Looking to your logs with the com.ibm.ejs.j2c.PoolManager, I'd recommend you to go to WAS support. Looks like your the Connection Pool is very small for your use-case. Especially having your clue:
It happens when i post too many requests on server.
I'm using jmeter to stress test a GAE web service which uses CloudSQL and I'm getting intermittent communications link failure exceptions.
I've tried using direct connections and a connection pool, and I see exceptions in either scenario. The exceptions increase as the number of requests per second increase.
Note that we are using the highest tier of cloud sql, D32 and the tests are well under the max 3200 connections.
Here's a stack trace for reference:
The last packet sent successfully to the server was 0 milliseconds ago. The driver has not received any packets from the server.
at sun.reflect.GeneratedConstructorAccessor48.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:33)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:1117)
at com.mysql.jdbc.MysqlIO.<init>(MysqlIO.java:350)
at com.mysql.jdbc.ConnectionImpl.coreConnect(ConnectionImpl.java:2413)
at com.mysql.jdbc.ConnectionImpl.connectOneTryOnly(ConnectionImpl.java:2450)
at com.mysql.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:2235)
at com.mysql.jdbc.ConnectionImpl.<init>(ConnectionImpl.java:818)
at com.mysql.jdbc.JDBC4Connection.<init>(JDBC4Connection.java:46)
at sun.reflect.GeneratedConstructorAccessor46.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:33)
at com.mysql.jdbc.Util.handleNewInstance(Util.java:411)
at com.mysql.jdbc.ConnectionImpl.getInstance(ConnectionImpl.java:404)
at com.mysql.jdbc.GoogleNonRegisteringDriver$JdbcWrapper.getInstance(GoogleNonRegisteringDriver.java:276)
at com.mysql.jdbc.GoogleNonRegisteringDriver.connect(GoogleNonRegisteringDriver.java:246)
at java.sql.DriverManager.getConnection(DriverManager.java:571)
at java.sql.DriverManager.getConnection(DriverManager.java:215)
Update: I changed the connection pool settings to maxActive = 5 and maxIdle = 5 and the intermittent communications link exceptions went away. Note that I've tried commons dbcp and tomcat dbcp. I'm now seeing the following exceptions in the logs:
Caused by: java.sql.SQLException: java.lang.SecurityException: Unable to access gatherPerformanceMetrics
Caused by: java.sql.SQLException: java.lang.SecurityException: Unable to access includeThreadDumpInDeadlockExceptions
Caused by: java.sql.SQLException: java.lang.SecurityException: Unable to access nullNamePatternMatchesAll
From https://cloud.google.com/appengine/docs/java/cloud-sql/#Java_Size_and_access_limits
"Each App Engine instance cannot have more than 12 concurrent connections to a Google Cloud SQL instance."
Can you tell more about the test set-up? How many requests is jmeter sending to appengine and how many connections does the app instance open for each of those requests?
To everyone who are looking for why you might be getting "com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure" on a connection.
Make sure your IP is allowed if you are calling from a test server!
I was testing at a friends house, and this unhelpful error kept showing up.
I have setup 3 servers with Amazon EC2, and have each server with the following Zookeeper-config.
tickTime=2000
initLimit=10
syncLimit=5
clientPort=2181
server.1=server1address:2888:3888
server.2=server3address:2888:3888
server.3=server3address:2888:3888
I start zookeeper on each server, and after I start Solr on the servers, I get errors like this in Solr:
3766 [main] INFO org.apache.solr.common.cloud.ConnectionManager – Waiting for client to connect to ZooKeeper
3790 [main-SendThread(*serverAddress*:2181)] WARN org.apache.zookeeper.ClientCnxn – Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
This was apparently coming because Zookeeper wasn't running properly. What I then figured out was that zookeeper was producing this error:
2013-06-09 08:00:57,953 [myid:1] - INFO [ec2amazonaddress.com/ipaddress#amazon:QuorumCnxManager$Listener#493] - Received connection request /ipaddress:60855
2013-06-09 08:00:57,963 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager#368] - Cannot open
channel to 3 at election address ec2amazonaddress/ipaddress#amazon:
3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:35
4)
So the problem is with ZooKeeper. What I did was to start another server before the server I previously started first, and then it worked. However, after some restarts that didn't work anymore. In other words, it seems like the order of when you start the ZK server matters. I was able to see that some servers who were fired up first went into follower mode instead of leader mode right away, and maybe that's the reason. I have deleted and reinstalled my whole setup, but the problem was still there.
I have checked the ports and have killed all processes using ports 2181 and 2888/3888 before launching Zookeeper. What bothers me is that this has worked with the same setup earlier.
Hope some of you guys have some experience with this problem. Any suggestion that could be related to not being able to connect to ZK-servers is also welcomed