At a high level, here is what is happening:
We have two SQL Server 2008 R2 SP1 systems (Standard Edition on Windows NT 6.1 (Build 7601: Service Pack 1))
They are humming along just fine, communicating bi-directionally with no errors or issues.
We reboot system #2, expecting that any Service Broker messages sent to it while it is unavailable will queue up on system #1, until system #2 comes back up.
System #2 comes back up and everything there starts normally with no errors.
The messages that queued up on system #1 for system #2 remain queued up; they are never sent. Furthermore, new messages on that conversation also queue up and are never sent.
Messages sent on new conversations are transmitted just fine.
Details about the messages that are never sent:
A. While system #2 is down, the transmission_status for the messages in the queue show various errors indicating that it cannot communicate with system #2, as expected.
B. Shortly after system #2 comes back up, the transmissions_status for those messages goes blank. The blank status never changes after this point.
C. The conversation where messages stack up is in the CONVERSING/CO state. No columns in the system view indicate anything is any different from other queues that are working fine. (If I could find any flags set differently, I would know to terminate the bad conversation, but the system offers no clues--other than the ever-growing queue depth.)
D. The messages are never received on system #2, in the sense that my activation stored procedure is never called for these messages.
E. In Profiler (with all Broker trace types turned on), a good conversation shows these things being logged:
Broker:Conversation CONVERSING 1 - SEND Message Initiator
Broker:Message Classify 2 - Remote Initiator
[SQL Batch complete; SQL that caused the SEND to occur]
Broker:Remote Message Acknowledgement 1 - Message with Acknowledgement Sent Initiator
Broker:Message Classify 1 - Local Initiator
Broker:Conversation CONVERSING 6 - Received Sequenced Message Target
Broker:Remote Message Acknowledgement 3 - Message with Acknowledgement Received Initiator
Broker:Activation Microsoft SQL Server Service Broker Activation 1 - Start
A message being sent which is destined to get stuck shows only the first two of those events:
Broker:Conversation CONVERSING 1 - SEND Message Initiator
Broker:Message Classify 2 - Remote Initiator
As far as I can tell, this is all the farther those messages get. There is no indication that SQL Server tries to transmit them ever again. System #1 thinks the conversation is still good, but System #2 has forgotten it completely. System #1 never seems to figure this out. If we subsequently reboot system #1, then everything is back to normal with all messags flowing as intended.
I have considered that these messages have actually been sent, but that the acknowledgement is not making it back to system #1. But I don’t see any evidence of backed up queues of acknowledgements.
We have checked for numerous typical issues on both sides:
Broker is enabled on both sides.
2. All queues are on, with all appropriate things enabled (enqueue, receive). Queues are not poisoned.
3. No permissions issues exist that we know of.
4. We are not using fire-and-forget.
5. We are reusing conversations, as various people recommend doing. (In fact, conversation re-use is the problem here!)
6. We are trapping SQL exceptions, using transactions as instructed, etc.
7. ssbdiagnose returns no errors.
When a SQL Server host is rebooted, we expect that any queued up messages will eventually get sent, but they are not. What is going on here??
I understand this is a quite old thread, but I have combated exactly the same situation before, and in my case the network configuration was the culprit.
For some reason, the initiator has sent its messages from one IP address, but another IP has been opened to accept incoming replies (and this second IP has been specified in target's route).
I have detected this by accident, really. When I tried to end conversation on the target side, it hasn't closed, but the EndDialog message appeared in sys.transmission_queue with the status:
Connection attempt failed with error: '10060(A connection attempt
failed because the connected party did not properly respond after a
period of time, or established connection failed because connected
host has failed to respond.)'.
I have no idea why the target restart has triggered the breakdown, but when network engineers have fixed the issue and I changed the target's route, everything flew to their destinations as it was supposed from the start.
Related
I understand with the help of Signals, we can pass interrupts to the executing C programs, and direct them to behave according to the Handlers assigned. When ctrl+c is pressed, SIGINT is executed.
Currently I am running a setup, where in I have 2 systems. Both have servers(DiscoveryServers) capable of multicast extension(mDNS). I have made use of Signals like SIGHUP and SIGKILL.
I am trying to understand what SIGKILL and SIGHUP actually does here.
So in one of the systems, I have an extra server which registers to the local discovery server. This server is advertised throughout the network with the help of mDNS.
Now, through SIGHUP, i can see that the terminal close of the server is detected amd the corresponding handler is executed.
But when I shut the system down, both the server and the Discovery server of that system, should go off. But that is detected inconsistently( not always) through SIGHUP. I tried with SIGKILL and still the response is the same. It is very unclear as to why it is happening?
Is it because the mDNS is using UDP and the UDP is unreliable?
Eg: Consider A and B as two systemss. Both A and B have discoveryservers running (capable of mDNS: meaning they can advertise the server records throughout the network) which contains the list of servers running on their system respectively. An extra server (server E) running on A, registers to discovery server of A. Now the registered server records are also advertised. Because of mDNS, the discovery server in B also updates its cache with all the advertiesments from system A. B can be seen as a client, runs a particular API, to get the list of servers running on the netowrk. All of this works fine.
If i close the terminal of the extra server on purpose, SIGHUP handler works smoothly. I can see from logs of both the discovery servers that the extra server has been removed.
Now if i shutdown the system A unexpectedly, the terminals running the server applications should close. That can be achieved with SIGHUP handler. What i observe from the log of system B is that, sometimes there are clean removal of the servers, sometimes partial.
There is no clarity as to why it is happening.
when a client connects to my server side, after they connect if they switch to a VPN or something the server side still says the socket is alive and still tries to read from it i tried using another thread to check all my sockets constantly with read and close it if it returns -1 but it still doesn't do anything
It very depends on what type of protocol you use, but generalized question is : yes and no. You have to learn network protocol stack to know what you csn do in you situations, details of which you did not disclose.
Usual way to solve this problem is establish some policy or two way cpmmunocation. E.g. there was np data or "i'm alive" message send from client X for duration of time Y, we close connection. Or, send a regular "ping" message to client C and expect a response before period Y expires.
If we're talking TCP, and if the client's connection is properly closed, a message is sent to the server, so the server will know the connection is closed, so read/recv will return 0 bytes indicating EOF.
But you're asking about the times when the client becomes unable to communicate with the server. Detecting an absence of messages is necessarily done using a timeout.
You can have the server "ping" the client (send a message to which the client must respond) periodically.
You can have the client send a message periodically (a "heartbeat") when idle.
Either way, no message (of any kind) for X seconds indicates a broken connection.
If you enable the SO_KEEPALIVE socket option on each new TCP connection, the OS will automatically ping the remote side periodically to see if it still responds, and close the connection if it doesn't. The default timeout is several hours, but many OSes allow you to configure a lower timeout on a per-socket basis. Unfortunately, each one is different in how to do this. Linux, for example, uses the TCP_KEEPIDLE socket option. NetBSD (And probably other BSDs) uses TCP_KEEPALIVE. And so on.
What I'm doing
I'm implementing a websocket server on a stellaris board as the title says. At the moment I'm able to establish connection to the client and send a few frames.
The way I'm implementing the websocket
The way I'm developing it is something like a master slave communication. Whenever the client sends a string, the server decodes it and then answers. At the moment I'm simply responding to a character 'e', which is designed to be just a counter. The thing is that I implemented the websocket on the client side to send 'e' whenever it receives a message and then displays the message on the page.
The problem
The problem is that it does about 15 transactions and then I can see the communication being re-transmitted from and to the stellaris board and then the communication closes. After the connection closes I noticed that that I can't access any other page on the board. It simply doesn't respond anymore.
My assumptions of what may be causing it
This lead me to believe that the transactions are being too fast and there may be an implementation bug, lwIP bug or hardware bug (I'm using the enet_io example as base).
My assumptions on how to fix it
After seeing this I can imagine that what I need is to control the string being sent to the microcontroller so that it sends once a second, or maybe even less, because at the moment it was doing something like 1000 transactions per second and sometimes more.
The question
So ... after my trials I still have a few questions that need to be answered. Do websockets need this kind of relationship? Where client asks and server serves? Or can I simply stream data from the server to the client as long as the connection is open? Is my supposition that slowing down my rates will work?
Do websockets need this kind of relationship [request-response]? Where client asks and server serves? Or can I simply stream data from the server to the client as long as the connection is open?
The Websocket protocol doesn't require a request-response model (except for the connection establishing handshake).
The server can stream data to the client without worrying about any response or request from the client.
However, it's common practice to get a response or a ping from a client once in a while, just to know they're alive.
This allows the client to renew a connection if a message or ping fails to reach the server - otherwise the client might not notice an abnormally dropped connection (it will just assume no updates are being sent because there's no new data).
It also allows the server to know a connection is still alive even when no information is being exchanged.
Is my supposition that slowing down my rates will work?
I guess this question becomes less relevant due to the first question's answer... however, I should probably note that the web socket client (often a browser) will have limited resources and a different memory management scheme.
Browsers are easy to overwhelm with too much data because they often keep references to all the exchanges since the page was loaded (or refreshed).
This is especially true when logging events to a browser's console.
I am using IBM WebSphere MQ to set up a durable subscription for Pub/Sub. I am using their C APIs. I have set up a subscription name and have MQSO_RESUME in my options.
When I set a wait interval for my subscriber and I properly close my subscriber, it works fine and restarts fine.
But if I force crash my subscriber (Ctrl-C) and I try to re open it, I get a MQSUB ended with reason code 2429 which is MQRC_SUBSCRIPTION_IN_USE.
I use MQWI_UNLIMITED as my WaitInterval in my MQGET and use MQGMO_WAIT | MQGMO_NO_SYNCPOINT | MQGMO_CONVERT as my MQGET options
This error pops up only when the topic has no pending messages for that subscription. If it has pending messages that the subscription can resume, then it resumes and it ignores the first published message in that topic
I tried changing the heartbeat interval to 2 seconds and that didn't fix it.
How do I prevent this?
This happens because the queue manager has not yet detected that your application has lost its connection to the queue manager. You can see this by issuing the following MQSC command:-
DISPLAY CONN(*) TYPE(ALL) ALL WHERE(APPLTYPE EQ USER)
and you will see your application still listed as connected. As soon as the queue manager notices that your process has gone you will be able to resume the subscription again. You don't say whether your connection is a locally bound connection or a client connection, but there are some tricks to help speed up the detection of connections depending on the type of connection.
You say that in the times when you are able to resume you don't get the first message, this is because you are retrieving this messages with MQGMO_NO_SYNCPOINT, and so that message you are not getting was removed from the queue and was on its way down the socket to the client application at the time you forcibly crashed it, and so that message is gone. If you use MQGMO_SYNCPOINT, (and MQCMIT) you will not have that issue.
You say that you don't see the problem when there are still messages on the queue to be processed, that you only see it when the queue is empty. I suspect the difference here is whether your application is in an MQGET-wait or processing a message when you forcibly crash it. Clearly, when there are no messages left on the queue, you are guarenteed with the use of MQWL_UNLIMITED, to be in the MQGET-wait, but when processing messages, you probably spend more time out of the MQGET than in it.
You mention tuning down the heartbeat interval, to try to reduce the time frame, this was a good idea. You said it didn't work. Please remember that you have to change it at both ends of the channel, or you will still be using the default 5 minutes.
When calling SQL Server from a client using ODBC, if a long-running query is run causing the time specified in SQL_ATTR_QUERY_TIMEOUT to be exceeded, I see that control is returned to the application. My question is does the work continue within the SQL Server engine. If it does continue what can be done to abort/cancel/stop the request on the server? Is there a best practice consideration to keep in mind?
The client sends an Attention signal to the server:
The client can interrupt and cancel the current request by sending an Attention message. This is also known as out-of-band data, but any TDS packet that is currently being sent MUST be finished before sending the Attention message. After the client sends an Attention message, the client MUST read until it receives an Attention acknowledgment.
The engine will abort the batch at the first opportunity (for all practical reasons, right away) and send back the Attention ack. In certains states a batch cannot be interrupted, eg. while rolling back a transaction. In such case a client may request an abort, but the response will come only after the server terminates the non-interruptible work.
The above is true for any SQL Server client stack, eg. exactly the same is how SqlCommand.CommandTimeout works.
The KILL command works in a very similar manner except that is not a client-server communication, but killing-SPID -> victim-SPID communication.