I am using IBM WebSphere MQ to set up a durable subscription for Pub/Sub. I am using their C APIs. I have set up a subscription name and have MQSO_RESUME in my options.
When I set a wait interval for my subscriber and I properly close my subscriber, it works fine and restarts fine.
But if I force crash my subscriber (Ctrl-C) and I try to re open it, I get a MQSUB ended with reason code 2429 which is MQRC_SUBSCRIPTION_IN_USE.
I use MQWI_UNLIMITED as my WaitInterval in my MQGET and use MQGMO_WAIT | MQGMO_NO_SYNCPOINT | MQGMO_CONVERT as my MQGET options
This error pops up only when the topic has no pending messages for that subscription. If it has pending messages that the subscription can resume, then it resumes and it ignores the first published message in that topic
I tried changing the heartbeat interval to 2 seconds and that didn't fix it.
How do I prevent this?
This happens because the queue manager has not yet detected that your application has lost its connection to the queue manager. You can see this by issuing the following MQSC command:-
DISPLAY CONN(*) TYPE(ALL) ALL WHERE(APPLTYPE EQ USER)
and you will see your application still listed as connected. As soon as the queue manager notices that your process has gone you will be able to resume the subscription again. You don't say whether your connection is a locally bound connection or a client connection, but there are some tricks to help speed up the detection of connections depending on the type of connection.
You say that in the times when you are able to resume you don't get the first message, this is because you are retrieving this messages with MQGMO_NO_SYNCPOINT, and so that message you are not getting was removed from the queue and was on its way down the socket to the client application at the time you forcibly crashed it, and so that message is gone. If you use MQGMO_SYNCPOINT, (and MQCMIT) you will not have that issue.
You say that you don't see the problem when there are still messages on the queue to be processed, that you only see it when the queue is empty. I suspect the difference here is whether your application is in an MQGET-wait or processing a message when you forcibly crash it. Clearly, when there are no messages left on the queue, you are guarenteed with the use of MQWL_UNLIMITED, to be in the MQGET-wait, but when processing messages, you probably spend more time out of the MQGET than in it.
You mention tuning down the heartbeat interval, to try to reduce the time frame, this was a good idea. You said it didn't work. Please remember that you have to change it at both ends of the channel, or you will still be using the default 5 minutes.
Related
I've got a C client listening to Tibco RV (using 8.4.0). The source pumps out messages on PREFIX1.* and PREFIX2.* pretty frequently (can be several times per second).
I have six threads, each listening to a particular SUFFIX, eg PREFIX1.SUFFIX_A and PREFIX2.SUFFIX_A. So each thread has a listener and its own queue for these two messages. I've got a queue size limit of 1000, dropping the oldest 200 if we hit that (but never have more than about 40 in the queue at busy times).
After running fine for many hours, each day the program suddenly stops receiving data. The source continues to publish but I no longer dispatch events from any queue. I don't understand what can have caused this (aside from deleting the listeners).
What might have caused the listening to stop? Or alternatively, given the system is high frequency how can this be investigated? Can I tell whether a listener is still active via the C interface? I couldn't see anything in the API for that.
Thanks for any help,
-Dave
It looks like the problem was that the machine had only a partial install of RV. In particular, there was no rv daemon in the package that we had for that machine. I'm actually a bit confused how we managed to get network data at all after re-reading the docs but it seems that without a daemon we can achieve networking until a minor network problem, then nothing; with the daemon we recover from network errors.
So the fix for this case was simply to install the full package and ensure the daemon runs constantly. Now the problem appears to have disappeared.
I need an architecture for a single server reliably servicing multiple clients, with clients responding to unresponsive server similar to the lazy pirate pattern from the 0MQ guide (ie, they use zmq_poll to poll for replies; if timeout elapses, disconnect and reconnect the client socket and resend the request).
I took the "lazy pirate pattern" as a starting point, from the ZMQ C language examples directory (lpclient.c and lpserver.c). Removed the simulated failure stuff from lpserver.c so that it would run normally without simulating crashes, as follows:
Server has a simple loop:
Read next message from the socket
Do some simulated work (1 second sleep)
Reply that it has serviced the request
Client has simple loop:
Send request to server
Run zmq_poll to check for response with some set timeout value
If timeout has elapsed, disconnect and reconnect to reset the connection and resend request at start of next iteration of loop
This worked great for one or two clients. I then tried to service 20 clients by running them like:
$ ./lpserver &
$ for i in {1..20}
do
./lpclient &
done
The behaviour I get is:
Clients all send their requests and begin polling for replies.
Server does one second work on first message it gets, then replies
First client gets its response back and sends a new request
Server does one second work on second message it gets, then replies
Second client gets its response back and sends a new request
Server receives third client's request, but third client times out before work completes (2.5 second timeout, server work period is 1 second, so on the third request clients start dropping out).
Multiple clients (fourth through Nth) timeout and resend their requests.
Server keeps processing the defunct requests from the incoming message queue and doing work which hogs up the server, causing all clients to eventually timeout as it takes 20 seconds to get through each round of the queue with all of the defunct messages.
Eventually all clients are dead and server is still spitting out responses to defunct connections. This is terrible because the server keeps responding to requests the client has given up on (and therefore shouldn't expect that the work has been done), and spending all this time servicing dead requests guarantees that all future client requests will timeout.
This example was presented as a way to handle multiple clients and a single server, but it simply doesn't work (I mean, if you did very quick work and had a long timeout, you would have some illusion of reliability, but it's pretty easy to envision this catastrophic collapse rearing its head under this design).
So what's a good alternative? Yes, I could shorten the time required to do work (spinning off worker threads if needed) and increase the timeout period, but this doesn't really address the core shortcoming - just reduces its likelihood - which isn't a solution.
I just need a simple request / reply pattern that handles multiple clients and a single server that processes requests serially, in the order they're received, but in which clients can time-out reliably in the event that the server is taking too long and the server doesn't waste resources responding to defunct requests.
I am testing my WPF application connecting to Azure Blob Storage to download a bunch of images using TPL (tasks).
It is expected that in Live environment, there will be highly transient connection to the internet at deployed locations.
I have set Retry Policy and time-out in BlobRequestOptions as below:
//Note the values here are for test purposes only
//CloudRetryPolicy is a custom method returning adequate Retry Policy
// i.e. retry 3 times, wait 2 seconds between retries
blobClient.RetryPolicy = CloudRetryPolicy(3, new TimeSpan(0, 0, 2));
BlobRequestOptions bro = new BlobRequestOptions() { Timeout = TimeSpan.FromSeconds(20) };
blob.DownloadToFile(LocalPath, bro);
The above statements are in a background task that work as expected and I have appropriate exception handling in background task and the continuation task.
In order to test exception handling and my recovery code, I am simulating internet disconnection by pulling out the network cable. I have hooked up a method to System.Net.NetworkChange.NetworkAvailabilityChanged event on UI thread and I can detect connection/disconnection as expected and update UI accordingly.
My problem is: If I pull the network cable while a file is being downloaded (via blob.DownloadToFile), the background thread just hangs. It does not timeout, does not crash, does not throw exception, nothing!!! As I write, I have been waiting ~30 mins and no response/processing has happened in relation to background task.
If I pull the network cable, before download starts, execution is as expected. i.e. I can see retries happening, exceptions raised and passed ahead and so on.
Has anyone experienced similar behaviour? Any tips/suggestions to overcome this behaviour/problem?
By the way, I am aware that I can cancel the download task on detection of network connectivity loss, but I do not want to do this as network connectivity can get restored within the time-out duration and the download process can continue from where it was interrupted. I have tested this auto resumption and works nicely.
Below is a rough indication of my code structure (not syntactically correct, just a flow indication)
btnClick()
{
declare background_task
attach continuewith_task to background task
start background task
}
background_task()
{
try
{
... connection setup ...
blob.DownloadToFile(LocalPath, bro);
}
catch(exception ex)
{
... exception handling ....
// in case of connectivity loss while download is in progress
// this block is not getting executed
// debugger just sits idle without a current statement
}
}
continuewith_task()
{
check if antecedent task is faulted
{
... do recovery work ...
// this is working as expected if connectivity is lost
// before download starts
// this task does not get called if connectivity is lost
// while file transfer is taking place
}
else
{
.. further processing ...
}
}
Avkash is correct I believe. Also, to be clear, you will basically never see that network removed error so not a lot of point in testing for it. You will see a ton of connection rejected, conflicts, missing resources, read-only accounts, throttles, access denied, even DNS resolution failures depending on how you are handling storage accounts. You should test for those.
That being said, I would suggest you do not use the RetryPolicy at all with blob or table storage. For most of the errors you will actually encounter, they are not retryable to begin with (e.g. 404, 409, 403, etc.). When you have a retry policy in place, it will by default actually try it 4 more times over the next 2 minutes. There is no point in retrying bad credentials for instance.
You are far better off to simply handle the error and retry selectively yourself (timeouts and throttle are about the only thing that make sense here).
Your problem is mainly caused because Azure storage client libraries uses file streaming classes underneath and that why the API hang is not directly related with Windows Azure Blob client library. Calling file streaming API directly over network you can see the exact same behavior when network cable is suddenly removed, however removing network gracefully will return different behavior.
If you search on internet you will find streaming classes does not detect the network loss and that's why in your code you can check the network disconnect event and then stop the background streaming thread.
At a high level, here is what is happening:
We have two SQL Server 2008 R2 SP1 systems (Standard Edition on Windows NT 6.1 (Build 7601: Service Pack 1))
They are humming along just fine, communicating bi-directionally with no errors or issues.
We reboot system #2, expecting that any Service Broker messages sent to it while it is unavailable will queue up on system #1, until system #2 comes back up.
System #2 comes back up and everything there starts normally with no errors.
The messages that queued up on system #1 for system #2 remain queued up; they are never sent. Furthermore, new messages on that conversation also queue up and are never sent.
Messages sent on new conversations are transmitted just fine.
Details about the messages that are never sent:
A. While system #2 is down, the transmission_status for the messages in the queue show various errors indicating that it cannot communicate with system #2, as expected.
B. Shortly after system #2 comes back up, the transmissions_status for those messages goes blank. The blank status never changes after this point.
C. The conversation where messages stack up is in the CONVERSING/CO state. No columns in the system view indicate anything is any different from other queues that are working fine. (If I could find any flags set differently, I would know to terminate the bad conversation, but the system offers no clues--other than the ever-growing queue depth.)
D. The messages are never received on system #2, in the sense that my activation stored procedure is never called for these messages.
E. In Profiler (with all Broker trace types turned on), a good conversation shows these things being logged:
Broker:Conversation CONVERSING 1 - SEND Message Initiator
Broker:Message Classify 2 - Remote Initiator
[SQL Batch complete; SQL that caused the SEND to occur]
Broker:Remote Message Acknowledgement 1 - Message with Acknowledgement Sent Initiator
Broker:Message Classify 1 - Local Initiator
Broker:Conversation CONVERSING 6 - Received Sequenced Message Target
Broker:Remote Message Acknowledgement 3 - Message with Acknowledgement Received Initiator
Broker:Activation Microsoft SQL Server Service Broker Activation 1 - Start
A message being sent which is destined to get stuck shows only the first two of those events:
Broker:Conversation CONVERSING 1 - SEND Message Initiator
Broker:Message Classify 2 - Remote Initiator
As far as I can tell, this is all the farther those messages get. There is no indication that SQL Server tries to transmit them ever again. System #1 thinks the conversation is still good, but System #2 has forgotten it completely. System #1 never seems to figure this out. If we subsequently reboot system #1, then everything is back to normal with all messags flowing as intended.
I have considered that these messages have actually been sent, but that the acknowledgement is not making it back to system #1. But I don’t see any evidence of backed up queues of acknowledgements.
We have checked for numerous typical issues on both sides:
Broker is enabled on both sides.
2. All queues are on, with all appropriate things enabled (enqueue, receive). Queues are not poisoned.
3. No permissions issues exist that we know of.
4. We are not using fire-and-forget.
5. We are reusing conversations, as various people recommend doing. (In fact, conversation re-use is the problem here!)
6. We are trapping SQL exceptions, using transactions as instructed, etc.
7. ssbdiagnose returns no errors.
When a SQL Server host is rebooted, we expect that any queued up messages will eventually get sent, but they are not. What is going on here??
I understand this is a quite old thread, but I have combated exactly the same situation before, and in my case the network configuration was the culprit.
For some reason, the initiator has sent its messages from one IP address, but another IP has been opened to accept incoming replies (and this second IP has been specified in target's route).
I have detected this by accident, really. When I tried to end conversation on the target side, it hasn't closed, but the EndDialog message appeared in sys.transmission_queue with the status:
Connection attempt failed with error: '10060(A connection attempt
failed because the connected party did not properly respond after a
period of time, or established connection failed because connected
host has failed to respond.)'.
I have no idea why the target restart has triggered the breakdown, but when network engineers have fixed the issue and I changed the target's route, everything flew to their destinations as it was supposed from the start.
Picture the scenario, I have an ADAPTER written in C that writes messages to SAP (calling an RFC).
The adapter is only called when a new message arrives in the "engine", so can have periods of no activity for up to 1 day or more. This is where the problem come in, the connection handle becomes invalidated on the "low level socket" layer in STANDARD code /or by some SAP parameters on SAP itself that may say "kill handles that are no longer active "
So what I do now is SPAWN a thread that "sits on top of the adapter" and PINGS SAP every 10 seconds or so. The issue here is that I am using the SAME connection handle for sending messages to SAP as well as for the PING / HEARTBEAT message.
SAP says for the RFC handle:
"AN RFC handle can be used in several threads, but can only be active in one thread at a time. AN RFC handle of an RFC connection, created by one thread can be used in
another thread, but these threads have to synchronize the access to this handle."
But now I have tried using "pthread_mutex_lock" etc to make this work but it does not.
I have one GLOBAL "handle", and when my adapter SHARED LIB starts up I launch a thread like follows:
rc = pthread_create(&heartbeatThread, NULL, heartbeatThreadMainLoop, (void *)NULL);
And this thread just PINGS SAP every 10 or so seconds.
In a perfect world I would like the MESSAGING to SAP to take priority here, so the PING should totally wait until it is "quiet" and then start up again.
I have looked at links like:
http://www.yolinux.com/TUTORIALS/LinuxTutorialPosixThreads.html#SYNCHRONIZATION
But I actually want to LOCK / UNLOCK a whole section of code, so as I said if the MESSAGE is going into SAP the PING thread must wait....BUt if the PING thread is busy I would like to somehow INTERRUPT it and say "hey, I need that connection handle for messaging"...
What is a best practice "pattern" for this?
And help would be hugely appreciated
Thanks
Lynton
The whole architecture can be simplified by increasing the scope of ADAPTER.
Instead of its main loop waiting for a request indefinitely, have it time out after 10 seconds. If it times out, do the wake up logic. In either case (request or timeout), reset the timer.
That avoids the whole problem of sharing, and makes ADAPTER responsible for all interaction with SAP.