Service Broker conversations: to end or not to end™ - sql-server

Time for experts (waiting for #RemusRusanu and mates^^)
Context
I built a service broker thing, heavily based on SQLTeam's blog example (http://www.sqlteam.com/article/centralized-asynchronous-auditing-with-service-broker), to simulate ETL behavior bewteen two databases on the same SQL Server instance. After hours experiencing tricks with brokers, conversations, disabled queues, poisoning messages, activation issues and so on, I finally made it to work.
Questions
To make it a bit more production-proof, I only had to deal with following topics:
CLOSED-status conversations are not being cleared from sys.conversation_endpoints while message queues are emptied: SQLTeam's example does not provide any END CONVERSATION in neither of initiator or receiver side, hence adding comment like
No need to close the conversation because auditing never ends
in activated procedure. Internet is full of testimonials NOT to adopt fire-and-forget behavior. I added some END CONVERSATION at the end of my activated procedure, with the same result. Tons of used conversations. I know about the benefit of reusing dialogs (thanks to #RemusRusanu) but I want the system to be as stable as possible, i.e avoid running out of available endpoint sockets.
=> What am I supposed to REALLY do? end conversations/not? Has LIFETIME to do with such situation?
Trapped errors are not persisted into Errors table, though TRY/CATCH is being used, with INSERT statement placed after the rollback.
;SEND ON CONVERSATION #dlgId
MESSAGE TYPE [//Sync/Message] (#syncedData)
END TRY
BEGIN CATCH
IF (XACT_STATE()) = -1
ROLLBACK;
INSERT INTO SyncErrors(ErrorProcedure, ErrorLine, ErrorNumber, ErrorMessage, ErrorSeverity, ErrorState, SyncedData)
SELECT ERROR_PROCEDURE(), ERROR_LINE(), ERROR_NUMBER(), ERROR_MESSAGE(), ERROR_SEVERITY(), ERROR_STATE(), #syncedData
END CATCH
=> Why does this never work when error raises?
In some examples, communication looks like being handled with exchange of different explicit message types (error and conversation acknowledgement), on top on custom XML message. As far as I understand, this implies two "crossed" endpoints with accurate queue processing activation procedures (vs one for my current example). There are so many opinions regarding this topic, I am a bit confused. Minutes ago, I found this long http://rusanu.com/2007/10/31/error-handling-in-service-broker-procedures article from #RemusRusanu, but didn't have the chance to give it a look until now.
=> Will this provide stability and robustness increase I am looking for?
Last but nos least. I recently faced a 70+GB unexpected fulfillment of system volume (Windows event log, by the way), withing 2 days, aligned with service broker use, that messed up my virtual machine until I found the root cause (thanks to Treesize Professional tool).
=> Is it normal behavior? Any way to avoid such verbosity, cap event log growth or configure rotation cycles?
Thanks a lot for bringing me to the dark side of the Force :)

Related

About multiple conversations and/or queues

I'm wondering about the exact definition of a conversation and MS docs and tutorials are not quite on point with this.
First... is there a difference between a dialog and a conversation ?
Assuming a queue should only contain identical messages or equivalent messages (I.E. message types being handled by an activated procedures in a way similar to a CASE WHEN / SWITCH scenario)
Does each conversation revolve around a unique queue?
If a procedure A sends a message to a queue activating a procedure B which handle the message then emits an answer, can procedure A wait for the answer or should I use a procedure C? Am I right to assume that I must create two queues operating on the same contract? But how many services? In that scenario how and where would I use END CONVERSATION?
If a procedure A sends a message to a queue activating a procedure B which handle the message then emits another/several messages(s) for another/some other procedure(s) C, are all those queues/services / etc. on the same conversation? The same conversation group? (what would I do after the GET CONVERSATION GROUP to ensure my conversations are in the same group?) Does that imply passing the same conversation handle when issuing BEGIN TRANSACTION / BEGIN DIALOG or using
[ WITH
[ { RELATED_CONVERSATION = related_conversation_handle
| RELATED_CONVERSATION_GROUP = related_conversation_group_id } ]
? And... last but not least, If I'm using multiple messages to parallel/fork calls to C with different parameters, in which case would I want to start totally different conversations/conversations groups doing the same thing or is it always better to have a unique "narration"
Oh... another thing... is there a best practice to use several messages to call some treatments then wait for every one of them to finish before starting another one? Is there a way in which each procedure would receive a message, send an answer, and then the procedure activated by the answers could check/count the previous messages in its queue and go on only if they are all there? Would that need to check the conversation id (or conversation group id) to be sure those messages are all emitted by the same group of answers?
I hope that's not too much confusing but MS tutorials are... well... a bit simplistic.
First, a dialog is the same as a conversation as far as I can tell. Two names for the same thing.
Queues can contain many different message types. It's up to the the thing processing the messages (whether that's an internally activated stored procedure or an external application) to discriminate on the type and do the "right thing" with it. A service can have only one queue, but a queue can have many services (though I haven't actually seen that in practice). A service defines what message types is can both accept and produce through the service contract.
In regards to your question about whether you want a queue processor to respond on the same conversation or start a new one is completely up to you. My suggestion would be to respond on the same conversation unless you know that you have a good reason not to. As to how to use the same conversation, you can get the conversation handle when you issue the receive statement. Use that as the conversation handle when you issue the subsequent send with your reply.
The way I think about conversation groups is you may need to talk to different services in regards to the same thing. Here's a contrived example:
Let's say that I have a new hire process. It has the following steps:
Create a login
Create an entry in the payroll system
Register them with your insurance provider
They're all logically for the same event though (i.e. "I hired a new employee"). So, you could bundle all of the conversations in one conversation group and keep track of the individual conversations separately. Something like this:
declare #handle uniqueidentifier, #group uniqueidentifier = NEWID(),
#message XML = '<employee name="Ben Thul" />';
BEGIN TRAN
begin dialog #handle
from service [EmployeeService]
to service 'LoginService'
on contract [LoginContract]
with related_conversation_group = #group;
SEND ON CONVERSATION (#handle)
MESSAGE TYPE [NewLoginRequest]
(#message);
INSERT INTO [dbo].[OpenRequests]
(
[GroupIdentifier],
[ConversationIdentifier],
[ServiceName],
[Status],
[date_modified]
)
VALUES
(#group, #handle, 'LoginService', 'RequestSent', GETUTCDATE());
BEGIN DIALOG #handle
FROM SERVICE [EmployeeService]
TO SERVICE 'PayrollService'
ON CONTRACT [PayrollContract]
WITH RELATED_CONVERSATION_GROUP = #group;
SEND ON CONVERSATION (#handle)
MESSAGE TYPE [NewPayrollRequest]
(#message);
INSERT INTO [dbo].[OpenRequests]
(
[GroupIdentifier],
[ConversationIdentifier],
[ServiceName],
[Status],
[date_modified]
)
VALUES
(#group, #handle, 'PayrollService', 'RequestSent', GETUTCDATE());
BEGIN DIALOG #handle
FROM SERVICE [EmployeeService]
TO SERVICE 'InsuranceService'
ON CONTRACT [InsuranceContract]
WITH RELATED_CONVERSATION_GROUP = #group;
SEND ON CONVERSATION (#handle)
MESSAGE TYPE [NewInsuranceRequest]
(#message);
INSERT INTO [dbo].[OpenRequests]
(
[GroupIdentifier],
[ConversationIdentifier],
[ServiceName],
[Status],
[date_modified]
)
VALUES
(#group, #handle, 'InsuranceService', 'RequestSent', GETUTCDATE());
COMMIT
Now, you have a way to track each of those requests separately and a way to tie them all to the same logical operation. As each service processes the message, it will respond back with either a success, failure, or "I need something else" message. At which point you can update the OpenRequests table with the current status.
Service broker can be overwhelming. My advice for you is to think about what messages need to be passed from where to where and start designing services, message types, contracts, etc around that. It's unlikely that you're going to use all of the functionality that SB has to offer.

Auto-Recover when DBNETLIB ConnectionWrite General network error causes ADO connections to go offline in Delphi applications?

Googling this ADO error message indicates that it is commonly encountered in ASP.NET development, but I have not found much mention of when it occurs in Delphi applications. We have some customer sites which are experiencing transient network problems, and this is the symptomatic error message. We can duplicate it in office testing easily; Just shut down an MS SQL Server service while your delphi TADOConnection object is connected to a database on that server instance and you get this exception:
[DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
Yes, catch this exception, and you know (or do you?) that this error has occurred. Except that this is an 800 KLOC+ application with over 10,000 try-except blocks around database actions, any one of which might fail with this error.
TADOConnection has some error events, none of which fire in this case. However, the ADO Connection itself is faulted once this occurs, even if you restart the SQL database, TADOConnection.Connected remains true, but it's lying to you. It's really in a faulted state.
So then, my question is:
Can you detect this faulted state, and recover from it, in any way that is less work than going into 10,000 individual try-except blocks and setting some global "reconnect ADO global variable"?
I am hoping there is a way to go into TADOConnection.ConnectionObject (the underlying raw OLEDB COM ADO object) and detect this fault condition exists when we are starting a new query, so that we can reset the ADOConnection and continue the next time we run a query. Since our code is organized in a way that would allow us to detect this "after the failure" much more easily than it would allow us to do it the way I would do this in a 10 line demo application.
This other SO question asks why it happens, that is not what I'm asking, please don't give me "prevention" answers, I know about them already, I'm looking for a recovery and detection-of-stalled-ADO-connection technique other than catching the exceptions. In fact, this is a good example of exceptions gone wrong; ADO is a schrodingers-cat object in this failure mode.
I am aware of the MS Knowledgebase articles, and the various solutions floating around the internet. I'm asking about RECOVERING without losing customer data, once the error condition (which is often transient in our situations) has cleared. That means we freeze our app, show the exception to the customer, and when the customer clicks Retry or Continue, we attempt to repair and continue. note that our existing code does a million try-except-log-and-continue code, that is going to get in our way, so I'm expecting someone to answer that an Application handler for unhandled exceptions is the best way, but sadly we can't use it. I really hope however that it is possible to detect a frozen/faulted/dead ADO connection object.
Here's what I have:
try
if fQueryEnable and ADOConnection1.Connected then begin
qQueryTest1.Active := false;
qQueryTest1.Active := true;
Inc(FQryCounter);
Label2.Caption := IntToStr(qQueryTest1.RecordCount)+' records';
end;
except
on E:Exception do begin
fQueryEnable := false;
Memo1.Lines.Add(E.ClassName+' '+E.Message);
if E is EOleException and Pos('DBNETLIB',E.Message)>0 then begin
ADOConnectionFaulted := boolean; { Global variable. }
end;
raise;
end;
end;
The problem with the above solution is that I need to copy and paste it about 10,000 places in my application.
Well nobody has answered this question, and I think that some follow-up would be helpful.
Here is what I have learned:
There are no reliable situations where in a test environment you can reproduce this General Network Error. That is to say, we're dealing with Irreproducible Results, which is where many developers leap into evil hackery in an attempt to "monkeypatch" their broken systems.
Fixing the underlying fault has always and everywhere been better than fixing it in code, when the SQL library gives a "General Network Error". No repair has ever been shown to be possible, because usually it means "the network is so unreliable that TCP itself has given up on delivering my data", this happens when:
You have a bad network cable.
You have duplicate IP addresses on a network.
You have dueling DHCP servers each handling out different default gateways.
You have local ethernet segments that have poor connectivity between them.
You have an ethernet switch or hub which is failing.
You are being intermittently blocked by a malfunctioning firewall.
Your customer may have changed something on their network, and may now be unable to use your software. (This last one actually happens more than you might think)
Someone may have configured an SQL Alias using cliconfg or other client side configuration elements that are specific to a single workstation's registry settings, and this local configuration may result in bad behaviour that is difficult to diagnose and may be limited to one or several workstations on a large network.
None of the above can be detected and reported either at the TCP or SQL level. When SQL finally gives up, and it gives this "General Network Error", no amount of cajoling from my software is going to get it to un-give-up, and even if it did, I would be doing a "try/except/ignore" antipattern. This error is so severe that we should raise it all the way up to the user, log it to disk in an error log, give up (quit the program), and tell the user that the network connection is down.
I have seen this happening due to bad coding too..
If you open a recordset using a connection and if you reuse that same connection in a loop for another recordset while the first connection is not closed then that can cause similar errors.
Another occasion very rarely on web applications is while the application pool is recycling you may receive similar error.
We have different sites in a same server where I have noticed that with the same application but with different customisations, only one site is causing this issue. That leads to the above findings.
This blog helped me to find the issues:
http://offbeatmammal.hubpages.com/hub/Optimising_SQL_Server
The code here detects a disconnect event firing and reconnects using a timer. It is assumed that you realize when reading this code that you must drop a TTimer onto this data module being shown here, and create an OnTimer event with the code shown below.
Please check the next code:
unit uDM;
interface
uses
SysUtils, Classes, DB, ADODB, Vcl.ExtCtrls;
type
TDM = class(TDataModule)
ADOConnection: TADOConnection;
ConnectionTimmer: TTimer;
procedure ADOConnectionDisconnect(Connection: TADOConnection;
var EventStatus: TEventStatus);
procedure ConnectionTimmerTimer(Sender: TObject);
private
{ Private declarations }
public
{ Public declarations }
end;
var
DM: TDM;
implementation
{$R *.dfm}
procedure TDM.ADOConnectionDisconnect(Connection: TADOConnection;
var EventStatus: TEventStatus);
begin
if eventStatus in [esErrorsOccured, esUnwantedEvent] then
ConnectionTimmer.Enabled := True;
end;
procedure TDM.ConnectionTimmerTimer(Sender: TObject);
begin
ConnectionTimmer.Enabled := False;
try
ADOConnection.Connected := False;
ADOConnection.Connected := True;
except
ConnectionTimmer.Enabled := True;
end;
end;
end.

Issues with T-SQL TRY CATCH?

We are currently on SQL 2005 at work and I am migrating an old Foxpro system to new web application backed by SQL Server. I am using TRY CATCH in T-SQL for transaction processing and it seems to be working very well. One of the other programmers at work was worried about this as he said he had heard of issues where the catch phrase did not always catch the error. I have beat the sproc to death and cannot get it to fail (miss a catch) and the only issues I have found searching around the net is that it will not return the correct error number for error numbers < 5000. Has anyone experienced any other issues with TRY CATCH in T-SQL - especially if it misses a catch? Thanks any input you may wish to provide.
TRY ... CATCH doesn't catch every possible error but the ones not caught are well documented in BOL Errors Unaffected by a TRY…CATCH Construct
TRY…CATCH constructs do not trap the
following conditions:
Warnings or informational messages that have a severity of 10 or lower.
Errors that have a severity of 20 or higher that stop the SQL Server
Database Engine task processing for
the session. If an error occurs that
has severity of 20 or higher and the
database connection is not disrupted,
TRY…CATCH will handle the error.
Attentions, such as client-interrupt requests or broken
client connections.
When the session is ended by a system administrator by using the KILL
statement.
The following types of errors are not
handled by a CATCH block when they
occur at the same level of execution
as the TRY…CATCH construct:
Compile errors, such as syntax errors, that prevent a batch from
running.
Errors that occur during statement-level recompilation, such as
object name resolution errors that
occur after compilation because of
deferred name resolution.
These errors are returned to the level
that ran the batch, stored procedure,
or trigger.
There was one case in my experience when TRY...CATCH block didn't catch the error. There was error connected with collation:
Cannot resolve the collation conflict between "Latin1_General_CI_AS" and "Latin1_General_CI_AI" in the equal to operation.
Maybe this error correspond one of the error type documented in BOL.
Errors that occur during statement-level recompilation, such as object
name resolution errors that occur after compilation because of
deferred name resolution.
TRY ... CATCH will fail to catch an error if you pass a "bad" search term to CONTAINSTABLE
For example:
DECLARE #WordList VARCHAR(800)
SET #WordList = 'crap"s'
CON
TAINSTABLE(table, *, #WordList)
The CONTAINSTABLE will give you a "syntax error", and any surrounding TRY ... CATCH does not catch this.
This is particularly nasty because the error is caused by data, not by a "real" syntax error in your code.
I'm working in SQL Server 2008. I built a big sql statement that had a try/catch. I tested it by renaming a table (in dev). The statement blew up and didn't catch the error. Try/catch in SQL Server is weak, but better than nothing. Here's a piece of my code. I can't put any more in because of my company's restrictions.
COMMIT TRAN T1;
END TRY
BEGIN CATCH
-- Save the error.
SET #ErrorNumber = ERROR_NUMBER();
SET #ErrorMessage = ERROR_MESSAGE();
SET #ErrorLine = ERROR_LINE();
-- Put GSR.dbo.BlahBlahTable back the way it was.
ROLLBACK TRAN T1;
END CATCH
-- Output a possible error message. Knowing what line the error happened at really helps with debugging.
SELECT #ErrorNumber as ErrorNumber,#ErrorMessage as ErrorMessage,#ErrorLine AS LineNumber;
I have never hit a situation where TRY...CATCH... failed. Neiteher, probably, have many of the people who read this question. This, alas, only means that if there is such a SQL bug, then we haven't seen it. The thing is, that's a pretty big "if". Believe it or not, Microsoft does put some effort into making their core software products pretty solid, and TRY...CATCH... is hardly a new concept. A quick example: In SQL 2005, I encountered a solid, demonstrable, and replicable bug while working out then-new table partitioning--which bug that had already been fixed by a patch. And TRY...CATCH... gets used a bit more frequently than table partitioning.
I'd say the burden of proof falls on your co-worker. If he "heard it somewhere", then he should try and back it up with some kind of evidence. The internet is full of proof for the old saying "just because everyone says its so, doesn't mean they're right".

Queue stops (disables) without any poison message

I have a queue that stops without any aparently reason, in this queue i have implemented a posion message handling. And during processing, it records and discards any poison messages.
It has worked fine for more than a year without stopping. But recently (the problem began four weeks ago), it stops once or twice a week. And only in this week it stopped twice.
And when I check the table with the new poisoned messages, there is none!! And when I enable the queue, processing resumes successfully and the 'poison message' situation does not reproduce.
About the task of the queue: Receives about 2-3000 messages per day. It is used to run stored procedures outside the transaction. And each message can last a little to be processed (doing a lot of selects, inserts, updates).
Let me explain this point: the database has triggers that are fired inside a transaction, the trigger sends a message to run some code outside the trigger. The asynchronous behavior prevents droping the performance of the database.
I have detected that even when a dead-lock occurs while proccessing the messages, the queue treats the message as poisoned. So in principle it shouldn't be a performance problem. But, can it be? Maybe the database is growing and it lasts too long to proces a messages?
But how can I find it out if it is not detected as posioned?
Why other reason a queue stops?
How can save when and with which message the queue got disabled?
Does anybody has any idea how I can do any forensics analysis?
Any idea?
UPDATE EXPOSING A PSEUDO-SOLUTION:
According Remus' post, I've tried to use the event notification to get the exact moment when the queue stops.
CREATE EVENT NOTIFICATION [QueueDisabledEN]
ON QUEUE [dbo].[ProcessQueue]
FOR BROKER_QUEUE_DISABLED
TO SERVICE 'Queue Watch Service', 'current database';
And then checking the event log:
select * from sys.event_notificiation
But since it is difficult to know the environment in which the event occurred, (what else was running at the momment??), forensic analysis ends there. Fortunately my broker service implementation stores the messages with the date of shipment, the date of receipt, date processing, ... This has helped me to detect that within 3 seconds the queue is flooded with hundreds of messages that take too long to be processed.
While I find a real solution the only temporary solution is to check with an agent job every x minutes the status of the queue and enable it:
IF (EXISTS(SELECT * FROM sys.service_queues WHERE name like 'ProcessQueue' AND (is_receive_enabled = 0 OR is_enqueue_enabled = 0))) BEGIN
PRINT convert(nvarchar, getdate(), 121)+ ': Activando la cola ProcessQueue'
ALTER QUEUE ProcessQueue WITH STATUS = ON
END
Thanks Remus!
When you find the queue in disabled state and you enable back the queue, I assume that the processing resumes successfully and the 'poison message' situation does not reproduce. This would indicate that the cause is transient or time related. It could be a SQL Agent job that is running and causes deadlocks with the queue processing, forcing the queue processing to rollback. Deadlocks are in my experience the most typical poison message cause. Your best forensics tool is the system event log, as the activated procedure does output errors into the ERRORLOG and hence into the system Event Log.
Whenever a queue is disabled by the poison message trigger (5 consecutive rollbacks) an event notification of type QUEUE_DISABLED is fired. You can capture more forensic information in the handling this event, as it will run shortly after the moment the queue was disabled.
As a side note, you can never have true 'poison message handling'. Whenever you enhance the processing to handle some error cases, the definition of the 'poison message' changes to be the message capable of disabling the new error handling.

Can I force certain settings by user in SQL Server?

I have a bunch of users making ad-hoc queries into a database running SQL Server. Occasionally someone will run a query that locks up the system for a long period of time as they retrieve 10MM rows.
Is it possible to set some options when a specific login connects? e.g.:
transaction isolation level
max rowcount
query timeout
If this isn't possible in SQL Server 2000, is it possible in another version? As far as I can tell, the resource governor does not give you control like this (it just lets you manage memory and CPU).
I realize the user's could do much of this themselves but it'd be awesome if I could control it from the server per user.
Obviously I'd like to move the users away from direct table/view access but that's not an option at the moment.
You can certainly limit the query results. I'm doing it for StackQL using a stored procedure that looks something like this:
CREATE PROCEDURE [dbo].[WebQuery]
#QueryText nvarchar(1000)
AS
BEGIN
INSERT INTO QueryLogs(QueryText)
VALUES(#QueryText)
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SET QUERY_GOVERNOR_COST_LIMIT 15000
SET ROWCOUNT 500
Begin Try
exec (#QueryText)
End Try
Begin Catch
SELECT ERROR_NUMBER() AS ErrorNumber,
ERROR_MESSAGE() AS ErrorMessage,
ERROR_PROCEDURE() AS ErrorProcedure,
ERROR_SEVERITY() AS ErrorSeverity,
ERROR_LINE() AS ErrorLine,
ERROR_STATE() AS ErrorState
End Catch
END
The important part here is the series of three SET statements after the log. This limits based on both the number of rows in the results and the expected costs of the query. The rowcount and query governor values can be use variables, and so it shouldn't be hard to modify that to change the restriction based on the current user as well.
However, you should also note that it's pretty easy for users who are "in the know" to bust out of that if they want. In my case I consider the ability to get past the limits from time to time a feature. But it's also why I do the logging: the code to get past the limits sticks out in the logs, and so I can easily catch and ban anyone doing it too often without my permission.
Finally, any user that calls this should be only in the denydatawriters role, the datareaders role, and then given explicit permissions to execute just this stored procedure. Then they can't really do anything but select on existing tables.
Now, I'll anticipate your next question is whether you can make this automatic from somewhere like report builder or management studio. Unfortunatley, I don't think that's possible. You'll need to give them some kind of interface that makes it easy to call your stored procedure. But I could be wrong here.
In SQL Server 2005 and 2008 you can add a LOGIN Trigger to the database that could change the connections settings, based on who they are connected as. However, you need to be very careful with these as a mistake or error in the trigger could result in everyone being locked out.
There's nothing like that in SQL Server 2000.
I just wanted to add, that if Joel's approach works for you, then I would strongly encourage that method over mine, because a Login Trigger is one mighty big and dangerous grenade to be throwing at this problem.
If you still really want to do them, here is a nice article that demonstrates how to use them. Even more important, however is here which has the crucial instructions for what to do if your Logon Trigger locks everyone out.

Resources