Diagnosing poor Sql Server Service Broker forwarding performance - sql-server

I've been running Service Broker in my development environment for a few months now and have had perfectly adequate performance, toward 1000 message per second (plenty for my needs).
I've also been running on a cut-down replica of my real production environment which involves a forwarding instance, and for the 1st time today pushed some load through it with terrible results! I'm trying to understand what I've been seeing, but am struggling a bit so I thought I'd put it out to see if anyone can help.
Firstly, messages are being delivered from start, to end, through the forwarder. However when I pushed a few thousand messages, I saw batches of between 20 to 100 being sent followed by delays of a minute or two. The messages are ultimately processed successfully.
Looking at the queue on the Store (the initial sender) there are thousands of messages sat waiting to be forwarded which are trickling out.
The security setup goes like:
Store database -> Certificate -> Forwarding instance -> Windows Security -> Central database
When I switch on profilers I am seeing lots errors:
Some examples on the forwarding instance:
7 - Send IO Error (10054(failed to retrieve text for this error. Reason: 15105))
Forwarded Message Dropped (The forwarded message has been dropped because a transport send error occurred when sending the message. Check previous events for the error.)
And on my 'central' target instance:
A corrupted message has been received. The binary message preamble is malformed.
Broker message undeliverable This message was dropped because it could not be dispatched on time. State: 2
Can anybody help by pointing me towards some checks I could make, or maybe something obvious that I've missed. I know I've got something wrong but just can't see what.
Edit - 14/1/2011 - more information:
Some more information on this - we took our message forwarding instance out of the equation and saw massive improvements immediately - 2000 messages were delivered in seconds.
The architecture uses transport security so we're currently trying to switch over to dialog security as we've read that transport security / forwarding can harm performance. We're hoping Dialog security will somehow optimize what needs to be decrypted by the forwarding instance therefore improving performance.
First thing Monday I want to switch off encryption on the transport layer (between initiator and forwarder) to see if that is where our bottleneck is occurring. Is it possible that this could cause a big overhead in our communications or should one forwarding instance not produce such a big bottleneck?

What SQL Server version?
There were several issues fixed with forwarding performance. I recommend you upgrade to latest SQL Server 2008 R2 and deploy latest cumulative updates. If upgrade is problematic in your environment, you can upgrade only the forwarder instance.

This might be a stupid suggestion, but have you changed the network topology lately? Maybe swapped out a network cable or overheated a switch? If this is occurring suddenly, it sounds more like a physical change than a logical one. I'd check the windows event log on both machines.

Yes, Dialog security is the best approach in conjunction with forwarders. Otherwise overhead will be enormous.

Related

Interbase 2020 crashes/loops

We use Interbase 2020 as production DB using UTF8 (approx 250 simultaneous user). With this database we have two main problems that we are not able to solve.
In history we had a problem with an older udf-function that crashed our database because it was not ready for unicode string operation. As a result we changed to unicode compatible versions.
The last few years sometimes we get hiccup (as we call it). In this case every client looses connection and the guardian restarts. The clients can connect again without us doing anything.
The second problem is that sometimes the interbase does not crash but everyone looses the connection and it is not possible to reconnect (by client, or ibexpert for example). In this case we have to restart the whole server.
These problems are occuring irregular. Most times it first starts with a hiccup. After a time (maybe two to ten hours later), the second problem arrives and we need to restart our database. If we are lucky we need to restart the server 2-3 times, on a bad day we need to restart the server more often as the second problem returns again and again (for example every 30 minutes).
We are not yet able to locate this problem. It doesn't matter if a user is connected to the database or just idling on weekends. It also often happens when nobody is connected.
Even the server logs don't give hints that helped us yet.
-We minimized udf function use as low as possible, changed to newer udfs that support unicode etc.
-functions that crash the server (afaik) are guarded that they dont get for example invalid datetimes
-We update database server regularely to newest version
-also updated client dlls
-also updated connection components (IBDAC) + Delphi 11.1
-wrote exception tracker in our client software (unfortunately there is only the connection lost error)
-regularely check active transactions if something hangs/loops/snapshot creation
Do you have any information that we could use to solve our problems? Is there any possibility to get more info out of the log files (other log levels possible?)? We don't want to log every procedure call if not necessary, but if there are no other options we need to..
Thanks for your help!
Matze,
I suggest you log a Case with our Support team at Embarcadero (https://www.embarcadero.com/support). They will work with you to understand the specifics of the crash, get relevant details (and Performance Monitoring information) from you, and help us work on a resolution (if not addressed already in our latest update).
We have addressed a few corner cases (and other crash reports) in many updates over the past couple years in InterBase 2020, and are eager to get to the bottom of this issue as well. You can see some of the resolved crash reports at https://docwiki.embarcadero.com/InterBase/2020/en/Resolved_Defects
Supporting 250 simultaneous users is not the problem, but understanding how the use cases are running into any potential system resource limits is important.
You do mention that you have the latest updates to InterBase 2020, but I do not see a build number in your message. You can get the most recent update build (14.4.0.804) of the server (if on Windows) from https://my.embarcadero.com/#downloadDetail/1383

How can I check if the system time of the DB server is correct?

I have got a bug case from the service desk, which was a result of different system times on the application server (JBoss) and DB(Oracle) server. As a result, timeouts lied.
It doesn't happen often, but for the future, it will be better if the app server could raise alarm about the bad time on the DB server before it results in some deeper problems.
Of course, I can simply read
Select CURRENT_TIMESTAMP
, and compare it against the local time. But it is probable that the time of sending the query and getting its result will get some noticeable time and I will recognize good time as bad one or vice versa.
I can also check the time from sending the query to the return of the result. But this way will work correctly in the case of the good net without lags. And if the time on the DB server fails, it is highly probable that the net around the DB server is not OK. The queues on the DB server can make the times of sending and receiving noticeably unequal.
What is the best way you know to check the time on the DB server?
Limitations: preciseness of 5 sec
false alarms <10%
To be optimized(minimized): lost alarms.
Maybe I am inventing the bicycle and JBoss and/or Oracle have some tool for that? (I could not find it)
Have a program running on the app server get the current time there, then query the database time (CURRENT_TIMESTAMP) and the app server gets the current time there after the query returns.
Confirm that the DB time is between the two times on the App Server (with any tolerance you need). You can include a separate check on how long it took to get the response from the DB but it should be trivial.
If the environment is some form of VM, issues are most likely to arise when the VM is started or resumed from a pause. There might be situations where a clock is running fast or slow so recording the times would allow you to look for trends in either direction and allow you to take preemptive action.

What happens when bandwidth is exceeded with a Microsoft SQL Server connection?

The application used by a group of 100+ users was made with VB6 and RDO. A replacement is coming, but the old one is still maintained. Users moved to a different building across the street and problems began. My opinion regarding the problem has been bandwidth, but I've had to argue with others who say it's database. Users regularly experience network slowness using the application, but also workstation tasks in general. The application moves large audio files and indexes them on occasion as well as others. Occasionally the database becomes hung. We have many top end, robust SQL Servers, so it is not a server problem. What I figured out is, a transaction is begun on a connection, but fails to complete properly because of a communication error. Updates from other connections become blocked, they continue stacking up, and users are down half a day. What I've begun doing the moment I'm told of a problem, after verifying the database is hung, is set the database to single user then back to multiuser to clear connections. They must all restart their applications. Today I found out there is a bandwidth limit at their new location which they regularly max out. I think in the old location there was a big pipe servicing many people, but now they are on a small pipe servicing a small number of people, which is also less tolerant of momentary high bandwidth demands.
What I want to know is exactly what happens to packets, both coming and going, when a bandwidth limit is reached. Also I want to know what happens in SQL Server communication. Do some packets get dropped? Do they start arriving more out of sequence? Do timing problems occur?
I plan to start controlling such things as file moves through the application. But I also want to know what configurations are usually present on network nodes regarding transient high demand.
This is a very broad question. Networking is very key (especially in Availability Groups or any sort of mirroring set up) to good performance. When transactions complete on the SQL server, they are then placed in the output buffer. The app then needs to 'pick up' that data, clear it's output buffer and continue on. I think (without knowing your configuration) that your apps aren't able to complete the round trip because the network pipe is inundated with requests, so the apps can't get what they need to successfully finish and close out. This causes havoc as the network can't keep up with what the apps and SQL server are trying to do. Then you have a 200 car pileup on a 1 lane highway.
Hindsight being what it is, there should have been extensive testing on the network capacity before everyone moved across the street. Clearly, that didn't happen so you are kind of left to do what you can with what you have. If the company can't get a stable networking connection, the situation may be out of your control. If you're the DBA, I highly recommend you speak to your higher ups and explain to them the consequences of the reduced network capacity. Often times, showing the consequences of inaction can lead to action.
Out of curiosity, is there any way you can analyze what waits are happening when the pileup happens? I'm thinking it will be something along the lines of ASYNC_NETWORK_IO which is usually indicative that SQL is waiting on the app to come back and pick up it's data.

Akka Actors: Handling DB Failures Without Losing Data

Scenario
The DB for an application has gone down. This results in any actor responsible for committing important data to the DB failing to get a connection
Preferred Behaviour
The important data is written to the db when it comes back up sometime in the future.
Current Implementation
The actor catches the DBException, wraps the data in a DBWriteFailed case class, and sends the message to its supervisor. The supervisor then schedules another write for sometime in the future (e.g. 1 minute) using system.scheduler.scheduleOnce(...) so that we don't spin in circles too much while waiting for the DB to come back up.
This implementation certainly works but I feel there might be a better way.
The protocol gets a bit messier when the committing actor has to respond to the original sender after a successful commit.
The regular flow of messages to the committing actor is not throttled in any way and the actor will happily process the new messages, likely failing to connect to the DB for each and every one of them.
If messages get caught in this retry loop for too long, the mailboxes of the committing actors will start to balloon. It is important that this data be committed, but none of it matters if the application crawls to a halt or crashes due to excessive memory usage.
I am an akka novice and I am largely inexperienced when it comes to supervisor strategies, but I feel as though I may be able to leverage one of those to handle some of this retry logic.
Is there a common approach in akka for solving a problem like this? Am I on the right track or should I be heading in a different direction?
Any help is appreciated.
You can use Akka Circuit Breaker to reduce connection attempts. Instead of using the scheduler as retry queue I would use a buffer (with max size limit) inside the actor and retry those when circuit breaker becomes closed again (onClose callback should send message to self actor). An alternative could be to combine the circuit breaker with a stashing mailbox.
If you're planning to implement full failover in your app
Don't.
Do not bubble database failover responsibility up into the app layer. As far as your app is concerned, the database should just be up and ready to accept reads and writes.
If your database goes down often, spend time making your database more robust (there's a multitude of resources on the web already for this: search the web for terms like 'replication', 'high availability', 'load-balancing' and 'clustering', and learn from the war stories of others at highscalability.com). It all really depends on what the cause of your DB outages are (e.g. I once maxed out the NIC on the DB master, and "fixed" the problem intermittently by enabling GZIP on the wire).
You'll be glad you adhered to a separation of concerns if you go down this route.
If you're planning to implement the odd sprinkling of retry logic and handling DB brown-outs
If you're not expecting your app to become a replacement database, then Patrik's answer is the best way to go.

Is it a good idea to use Database Mail as an email relay server?

One of our problems is that our outbound email server sucks sometimes. Users will trigger an email in our application, and the application can take on the order of 30 seconds to actually send it. Let's make it even worse and admit that we're not even doing this on a background thread, so the user is completely blocked during this time. SQL Server Database Mail has been proposed as a solution to this problem, since it basically implements a message queue and is physically closer and far more responsive than our third party email host. It's also admittedly really easy to implement for us, since it's just replacing one call to SmtpClient.Send with the execution of a stored procedure. Most of our application email contains PDFs, XLSs, and so forth, and I've seen the size of these attachments reach as high as 20MB.
Using Database Mail to handle all of our application email smells bad to me, but I'm having a hard time talking anyone out of it given the extremely low cost of implementation. Our production database server is way too powerful, so I'm not sure that it couldn't handle the load, either. Any ideas or safer alternatives?
All you have to do is run it through an SMTP server and if you're planning on sending large amounts of mail out then you'll have to not only load balance the servers (and DNS servers if you're planning on sending out 100K + mails at a time) but make sure your outbound Email servers have the proper A records registered in DNS to prevent bounce backs.
It's a cheap solution (minus the load balancer costs).
Yes, dual home the server for your internal lan and the internet and make sure it's an outbound only server. Start out with one SMTP server and if you get bottle necks right off the bat, look to see if it's memory, disk, network, or load related. If its load related then it may be time to look at load balancing. If it's memory related, throw more memory at it. If it's disk related throw a raid 0+1 array at it. If it's network related use a bigger pipe.

Resources