Process using zookeeper C client gets disconnected on SIGTERM [closed] - c

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 6 years ago.
Improve this question
We are using the Apache Zookeeper Client C bindings in our application. Client library version is 3.5.1. When the Zookeeper connection gets disconnected, the application is configured to exit with error code 116.
Systemd is being used to automate starting/stopping the application. The unit file does not override the default setting for KillMode, which is to send SIGTERM to the application.
When the process is stopped using the systemctl stop directive, the Zookeeper client threads seem to be attempting to reconnect to Zookeeper:
2016-04-12 22:34:45,799:4506(0xf14f7b40):ZOO_ERROR#handle_socket_error_msg#2363: Socket [128.0.0.4:61758] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2016-04-12 22:34:45,799:4506(0xf14f7b40):ZOO_INFO#check_events#2345: initiated connection to server [128.0.0.4:61758]
Apr 12 22:34:45 main thread: zookeeperWatcher: event type ZOO_SESSION_EVENT state ZOO_CONNECTING_STATE path
2016-04-12 22:34:45,801:4506(0xf14f7b40):ZOO_INFO#check_events#2397: session establishment complete on server [128.0.0.4:61758], sessionId=0x40000015b8d0077, negotiated timeout=20000
2016-04-12 22:34:46,476:4506(0xf14f7b40):ZOO_WARN#zookeeper_interest#2191: Delaying connection after exhaustively trying all servers [128.0.0.4:61758]
2016-04-12 22:34:46,810:4506(0xf14f7b40):ZOO_INFO#check_events#2345: initiated connection to server [128.0.0.4:61758]
2016-04-12 22:34:46,811:4506(0xf14f7b40):ZOO_ERROR#handle_socket_error_msg#2382: Socket [128.0.0.4:61758] zk retcode=-112, errno=116(Stale file handle): sessionId=0x40000015b8d0077 h
Due to this, the process is exiting with an error code. Systemd sees failure code upon exit and does not attempt to restart the application. Does anyone know why the client is getting disconnected?
I am aware that I can work around this by setting SuccessExitStatus=116 in the unit file, but I don't want to mask out genuine errors. I have tried registering a signal handler for SIGTERM and closing the Zookeeper client in the handler. But the handler code never seems to get hit when I issue systemctl stop.
EDIT: The handler wasn't getting called because I had made it asynchronous - it didn't execute immediately upon receiving signal. OTOH the process exits immediately upon Zookeeper disconnect.

What happens when you load the handler for SIGTERM and issue systemctrl stop?
If nothing occurs then you may have a mask blocking the signal (I guess not).
If the application keeps exiting with the same error code then I would suggest you make sure that the signal handler is being loaded correctly.

This is working expected, it's the application writer's responsibility to specify how to gracefully shutdown the service, if you don't want to use the default, which sends SIGTERM, you can use the ExecStop to make your own stop command in the unit files:
ExecStart=/usr/bin/app
ExecStop=/usr/bin/app -stop
For details see docs at
https://www.freedesktop.org/software/systemd/man/systemd.service.html#ExecStop=

The issue is unrelated, someone was running a script that was killing the connection. Thank you all for your help!

Related

libuv - How to test for a disconnection of a listening socket from client (relay application)

My application sporadically modifies and relays messages that it receives to a listener server daemon (all using unix domain sockets, so uv_pipe_t).
(Workflow that has me stumped) When the first message has to be relayed, it makes a uv_try_write() in the uv_read_done callback function (where it is reading on a listening socket of its own)
If the listening daemon is already up and running, this is the perfect, and the message is relayed
If the listening daemon is not yet up
uv_try_write fails, I check the status which is -ve (EAGAIN), so I try a connect (uv_pipe_connect). After this I uv_try_write() again.
Since the connect fails (ENOENT, I log an error and give up.)
I now start the listening daemon up
The uv_try_write again fails on the first message, despite the connect() (because I presume it makes the connect in the next loop iteration)
The second write onwards works fine and as expected
I kill the listening daemon
On the coming write, the app receives a SIGPIPE error ( I have blocked this with sigaction and sigprocmask)
I restart the listening daemon
This time the connect() fails with an EISCONN error ( which I presume means that the handle I used in the first connect is still live, and needs to be closed. However, since I cannot detect when the connection was closed from the listener daemon the last time, I cannot know when to close the handle.
Questions about best practice
Is the the best way to design the relay app? Perhaps not, because it is not very resilient dropping messages on reconnections, and I do not want to make hacks around this without ensuring I am following the proper practices using libuv
If it is, are any of the following options worthy?
some periodic timer setup heartbeating with the peers
a uv_check handle that is checking for connection status at every loop iteration somehow. If so, how to check for connection status? uv_is_writeable always returns 0, even on a connected socket. Same with uv_is_active
uv_try_write() from the on_connect callback function to send the first message that is getting dropped
Thanks very much in advance for the help!
You can call uv_write after you call uv_pipe_connect, and the write will be queued. Writes will happen after the connection succeeds, or fail with UV_ECANCELED if the connection failed.

Does a 'Broken pipe' exception cancel my job?

Currently I am running a Flink program on a remote cluster of 4 machines using 144 TaskSlots. After running for around 30 minutes I received the following error:
INFO
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet - Info
server for jobmanager: Failed to write json updates for job
b2eaff8539c8c9b696826e69fb40ca14, because
org.eclipse.jetty.io.RuntimeIOException:
org.eclipse.jetty.io.EofException at
org.eclipse.jetty.io.UncheckedPrintWriter.setError(UncheckedPrintWriter.java:107)
at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:280)
at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:295)
at
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.writeJsonUpdatesForJob(JobManagerInfoServlet.java:588)
at
org.apache.flink.runtime.jobmanager.web.JobManagerInfoServlet.doGet(JobManagerInfoServlet.java:209)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:734) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:847) at
org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:532)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:227)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:965)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:388)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:187)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:901)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
at
org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:47)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:113)
at org.eclipse.jetty.server.Server.handle(Server.java:352) at
org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
at
org.eclipse.jetty.server.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:1048)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:549)
at
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:211)
at
org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:425)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:489)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:436)
at java.lang.Thread.run(Thread.java:745) Caused by:
org.eclipse.jetty.io.EofException at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:905)
at
org.eclipse.jetty.http.AbstractGenerator.flush(AbstractGenerator.java:427)
at org.eclipse.jetty.server.HttpOutput.flush(HttpOutput.java:78) at
org.eclipse.jetty.server.HttpConnection$Output.flush(HttpConnection.java:1139)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:159) at
org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:86) at
java.io.ByteArrayOutputStream.writeTo(ByteArrayOutputStream.java:154)
at org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:258) at
org.eclipse.jetty.server.HttpWriter.write(HttpWriter.java:107) at
org.eclipse.jetty.io.UncheckedPrintWriter.write(UncheckedPrintWriter.java:271)
... 24 more Caused by: java.io.IOException: Broken pipe at
sun.nio.ch.FileDispatcherImpl.write0(Native Method) at
sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at
sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at
sun.nio.ch.IOUtil.write(IOUtil.java:51) at
sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:470) at
org.eclipse.jetty.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:185)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:256)
at
org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:849)
... 33 more
I know that java.io.IOException: Broken pipe means that the JobManager lost some kind of connection so I guess the whole job failed and I have to restart it. Although I think the process is not running anymore the WebInterface still lists it as running. Additionally the JobManager is still present when I use jps to identify my running processes on the cluster. So my question is if my job is lost and whether this error is happening randomly sometimes or whether my program caused it.
EDIT: My TaskManagers still send Heartbeats every few seconds and seem to be running.
It's actually a problem of the JobManagerInfoServlet, Flink's web server, which cannot sent the latest JSON updates of the requested job to your browser because of the java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method). Thus, only the GET request to the server failed.
Such a failure should not affect the execution of the currently running Flink job. Simply refreshing your browser (with Flink's web UI) should send another GET request which then hopefully completes successfully.

How do I detect shutdown/reboot from linux app [duplicate]

This question already has answers here:
how to detect Linux shutdown/reboot
(2 answers)
Closed 3 years ago.
I have an application written in C which runs as a daemon and needs to send something through RS232 when system is in shutdown or reboot state, and it needs to distinguish between these two.
So my idea is:
In my application script /etc/init.d/my_app in "stop" case of my script, I will run /sbin/runlevel command to get current runlevel:
0 - shutdown state
6 - reboot state
then I will execute some command to inform my daemon which state is it, daemon will perform communication through rs, and then will exit.
I think it should work, however it may not be the best solution, especially because my app is already running as a daemon, maybe I can receive some signal directly from system/kernel/library or through unix socket or something.
Best regards
Marek
I am not sure which signal is send to an application on system shutdown. My best guess is SIGTERM and if the application does not shutdown SIGKILL. So did you try to catch SIGTERM and properly shut down your program? There are a lot of examples on the net how to do that.
For more sophisticated process handling you can send SIGUSR1, SIGUSR2 to your application.

Detecting network activity in C [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Closed 8 years ago.
Improve this question
If I lose connection to a server, I start an alarm to go off in 10 minutes. In the meantime I try to create a socket and re-establish a connection to the server. If when the alarm goes off, there is no connection to the server, I want to close the application.
What would be a good way to go about checking if there is a live connection on a socket? I am unsure if blocking methods are acceptable (obviously if there is no alternative they are).
If I lose connection to a server, I start an alarm to go off in 10 minutes.
So at that point you knew there was no connection.
In the meantime I try to create a socket and re-establish a connection to the server. If when the alarm goes off, there is no connection to the server, I want to close the application.
What would be a good way to go about checking if there is a live connection on a socket? I am unsure if blocking methods are acceptable (obviously if there is no alternative they are).
If you knew there was no connection when you set the alarm, why don't you know the same thing when it expires?
It seems to me that all you need to do is examine the socket fd. If it is non-zero you have a connection; if it is zero you don't. And make sure you zero it when you set the alarm.
Just save a result of "connect()" function anywhere, so you will be able to check it in 10 minutes.
What would be a good way to go about checking if there is a live connection on a socket? I am unsure if blocking methods are acceptable (obviously if there is no alternative they are).
I Assume from this question end the explanation above that you have the idea how to handle a lost connection but don't know how to check if the connection is still alive.
Best way to check if the connection is still alive is to send periodically a dummy / heartbeat / keep-alive message to the server. As soon as the connection is dead the tcp socket will give you an error (after the timeout) so you know that the connection died and you can try to reconnect / flag the alarm etc.

Why server waits for a client after the client application has been put in STOPPED state?

This question is an extension to this previously asked question:
I implemented the solution given by jxh with following params:
SO_KEEPALIVE = Enabled
TCP_KEEPIDLE = 120 secs
TCP_KEEPINTVL = 75 secs
TCP_KEEPCNT = 1
Then why the server still waits forever for client to respond?
Also I found out on internet that
kill <pid> actually sends SIGTERM to the given process.
So I used ps -o pid,cmd,state command after 'killing' the telnet application.
I saw that the telnet process was still there but with process state = T, i.e. it was in STOPPED state
P.S.: I do not have much knowledge of Linux Signals, please consider that.
Because the client hasn't exited yet, being still in STOPPED state, and therefore hasn't closed its connections either.
Since the client processes are still alive, then the TCP stack in the kernel will process the keep-alive packets it receives with an acknowledgement packet back to the sender of the packet. So, even though the connection is indeed idle, the connection will never be closed since the kernel is happily processing the packets.
On a real network, given your parameters, the connection would be closed if the ACK from the client machine ever got lost. On your setup, since the client and server are on the same machine, your network will be essentially lossless.
It is unclear to me how you got your telnet sessions in this state. SIGTERM will not put a process in the stopped state. The process goes into stopped state when receiving SIGSTOP (and usually SIGTSTP, but it seems telnet ignores that one). I suggest that perhaps you sent that signal by mistake, or you suspended the session (with ^]z). When that happened, you should have seen in the window, the one with your telnet session, generate output like:
[1]+ Stopped telnet ...
This is printed by the shell. When the telnet process is stopped, it won't process the SIGTERM until after it is placed in the foreground.
A SIGKILL (done with kill -9 <pid>) will be processed immediately.

Resources