long single latch wait in SQL Server

long single latch wait in SQL Server - sql-server

I would like to share my problem related to page latch in SQL Server, that I don't know what the root cause is, and how to deal with it.
According to my experience, the page latch's wait time is normally high when many processes are accessing the same page; more specifically when it happens we might see many sessions captured from sp_whoisactive
The situation I am having is different; sp_whoisactive only captured 1 process that has been waiting for 30 seconds on PAGELATCH_EX on page xyz, not blocked by any session, and also not seeing other processes involved in this latch. It appears that only one session waits on this page for latch. so I don't know which session/process holds the incompatible latch on this xyz page.
Which info/metrics should I need to capture to trace this issue? Do you have any recommendations?

Related

Re-distributing messages from busy subscribers

We have the following set up in our project: Two applications are communicating via a GCP Pub/Sub message queue. The first application produces messages that trigger executions (jobs) in the second (i.e. the first is the controller, and the second is the worker). However, the execution time of these jobs can vary drastically. For example, one could take up to 6 hours, and another could finish in less than a minute. Currently, the worker picks up the messages, starts a job for each one, and acknowledges the messages after their jobs are done (which could be after several hours).
Now getting to the problem: The worker application runs on multiple instances, but sometimes we see very uneven message distribution across the different instances. Consider the following graph, for example:
It shows the number of messages processed by each worker instance at any given time. You can see that some are hitting the maximum of 15 (configured via the spring.cloud.gcp.pubsub.subscriber.executor-threads property) while others are idling at 1 or 2. At this point, we also start seeing messages without any started jobs (awaiting execution). We assume that these were pulled by the GCP Pub/Sub client in the busy instances but cannot yet be processed due to a lack of executor threads. The threads are busy because they're processing heavier and more time-consuming jobs.
Finally, the question: Is there any way to do backpressure (i.e. tell GCP Pub/Sub that an instance is busy and have it re-distribute the messages to a different one)? I looked into this article, but as far as I understood, the setMaxOutstandingElementCount method wouldn't help us because it would control how many messages the instance stores in its memory. They would, however, still be "assigned" to this instance/subscriber and would probably not get re-distributed to a different one. Is that correct, or did I misunderstand?
We want to utilize the worker instances optimally and have messages processed as quickly as possible. In theory, we could try to split up the more expensive jobs into several different messages, thus minimizing the processing time differences but is this the only option?

SQL HA Cluster TempDB Version Store blocking on secondary Replica due to open transaction? [migrated]

This question was migrated from Stack Overflow because it can be answered on Database Administrators Stack Exchange.
Migrated 17 days ago.
I am currently investigating a repeating error which occurs on the secondary Replica of our 2 node Alwasy on High Availability cluster. The Replica is set up with Read-Intent only because we use a separate Backup solution (Dell Networker).
The Tempdb keeps growing in the secondary replica because the Version Store never gets cleared.
I can fix it temporarly when i failover the Availability Groups, but after a couple of hours the error appears again on the repilca node. The error seems to follow one specific Availabilty Group, every node where its currently replicating gets the error after some time. So i guess it has to be a issue caused by a transaction and not from the sytem itself.
I tried all suggestions on google to find the culprit but even if i recklessly kill all sessions with last_batch in the timeframe i get from the Perfmon "longest running transaction time" indicator (as advised here: https://www.sqlservercentral.com/articles/tempdb-growth-due-to-version-store-on-alwayson-secondary-server ), it won't start cleaning up the Versionstore.
The shown elapsed seconds also match the output of the Query on the Secondary node:
select * from sys.dm_tran_active_snapshot_database_transactions
details are sadly not usefull:
Output
Here it shows Transaction ID 0 and Session ID 823 but the Session ID is long gone and keeps getting used by other processes alread. So i am stuck here.
I tried to match the Transaction_sequence_num with anyting, but no luck so far.
On the Primary Node it show no open transactions of any kind.
Any help finding the cause of this open snapshot transaction is appreciated.
I followed this guides already to find the issue:
https://sqlundercover.com/2018/03/21/tempdb-filling-up-on-secondary-replicas/
https://sqlgeekspro.com/tempdb-growth-due-to-version-store-in-alwayson/
https://learn.microsoft.com/en-us/archive/blogs/docast/rapid-growth-of-tempdb-on-alwayson-secondary-replica-due-to-version-store
https://www.sqlshack.com/how-to-detect-and-prevent-unexpected-growth-of-the-tempdb-database/
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/635a2fc6-550b-4e08-b232-0652bd6ea17d/version-store-space-not-being-released?forum=sqldatabaseengine
https://www.sqlservercentral.com/articles/tempdb-growth-due-to-version-store-on-alwayson-secondary-server
Update:
To show my claim that the session is long gone:
Here the Pictures you see the output of sys.dm_tran_active_snapshot_database_transactions, sys.sysprocesses and sys.dm_exec_sessions:
The first Picture shows currently 2 "open" snapshot database transactions. (normally it was always one in the past, but maybe the more the better) and it shows the now to this time running sessions on this ids.
Than i proceeded to kill session 899 and 823 and checked again:
Here you can see the active_snapshot_database_transaction is still showing the 2 Session_ids and the sysprocesses and dm_exec_sessions show now the 2 IDs are in use by a different Program, User, database etc. because i killed them and the ID number immediately got reused. If i check through the day sometimes they are even not in use at all.
If i check the elapsed time and the perfmon longest running transaction i would be looking for a session with a logintime or batch at aroung 2023-02-03 00:00:56. But if i check all sleeping sessions or sessions with last batch in this range and even kill all of them ( like described in all of the links above) it still shows the "transaction" in sys.dm_tran_active_snapshot_database_transactions with ever growing numbers.
Update 2:
in the meantime we needed to resolve the issue with a failover because the tempdb ran out of space. Now the new "stuck" session id as shown in dm_tran_active_transaction has the session id 47 and is currently at around 30000sec rising. So the problem started at around 11.2.2023 00:00:20.
Here is the Output of dm_tran_active_transaction:

There are many different assumptions going on here which need to be addressed.
First, the mechanism by which availability groups work is by log blocks. They don't send individual transactions. Log blocks are written to the log on the secondary and eventually the information inside of them is redone.
Second, the version store is only used on readable secondary replicas. Thus, looking at the primary for sessions that are using items on the secondary is not going to help. The version store can only be cleaned up by removing the oldest unused versions until it hits a version in use, it cannot skip versions, thus is version 3 is needed but 4-10 aren't, anything below 3 can be cleaned up but anything (including 3) can't.
Third, if a session is closed then any outstanding items are cleaned up. Whether that is freeing memory, killing transactions, etc. There is no evidence that the session is actually disconnected on your secondary replica that was given.
I wrote this in lieu of adding comments. If the OP decides to add in more data on the secondary they can and I'll address it. The replica can also be changed to not readable and that will solve the problem, since the issue is queries on the secondary.

Why is it that when the same query is executed using ExecuteReader() in vb .net twice, it returns two very different response times?

Whenever a user clicks on GetReport button, there is a request to the server where SQL is formed in the back end and connection is established with Database. When the function ExecuteReader() is executed, it returns data at different time responses.
There are 12 servers in Production environment and the settings is such that when there is no response for more than 60 seconds from the back end, the connection is removed and hence blank screen appears on "UI".
In my code the SQL is formed and connection is established and when ExecuteReader()function is executed, it is returning data after the interval of 60 seconds where as per settings in the server, the connection is removed and hence leading to appearance of blank screen.
If the ExecuteReader() function returns data within 60 seconds, then the functionality works fine. The problem is only when the ExecuteReader() function does not retrieve data within 60 seconds.
Problem is that ExecuteReader() function returns data within 2 seconds sometimes for the same SQl and sometimes takes 2 minutes to retrieve data.
Please suggest why there is variation in returning data at different time intervals for the same query and how should I be proceeding in this situation as we are not able to increase the response time in production because of security issues.
Code is in vb.net

You said it yourself:
how should I be proceeding in this situation as we are not able to increase the response time in production because of security issues.
There's nothing you can do
If, however, you do suddenly gain the permissions to modify the query that is being run, or reconfigure the resource provision of the production system, post back here with a screenshot of the execution plan and we can tell you any potential performance bottlenecks.
Dan's comment pretty much covers why a database query might be slow; usually a similar reason why YouTube is slower to buffer at 7pm - the parents got home from work at 6, the kids screamed at them for an hour ago wanting to go on YouTube while parent desperately tries to engage child in something more educational or physically active, parent finally gives in and wants some peace and quiet :) - resource provision/supply and demand in the entire chain between you and YouTube

Service bus queue in e-commerce application

We have an e-commerce application running on MS SQL.
Every now and then we have a flash sale, and once we start inserting all the orders into the database, our site's performance drops. We have it at the point where we can insert about 1,500 orders in a minute, but the site hangs for a few minutes after that. The site only hangs once the inserts start happening.
I have been looking into using Azure Service Bus queues mixed with SignalR to manage the order process, as this was suggested to me a while back. The way I see it happening is (broad overview):
Client calls a procedure on the server which inserts an order into a queue.
Client gets notified that they are in a queue.
We have a worker process which processes the order from the queue and inserts it into the database.
Server then notifies the client that the order is processed and moves them onto the payment page.
I am new to SignalR and queues in general so my questions are:
Will queues actually have a performance benefit. If so, why?
Are queues even the correct thing to use in this instance?

The overview you mention makes sense. It seems like you should be able to do it without SignalR since ServiceBus will let you know once it successfully inserted the message into the queue.
It is not that queues give you better performance for 1 request. Messages placed onto the queue will be stored until you are ready to process them. By doing this you will not suffer "peak" issues and you will be able to receive from the Queue at a speed that you know your system is able to sustain (Maybe 500 orders/minute or whatever number works for you).
So they will give you a much more stable latency per request without bringing down your system.

What does a Status of "Suspended" and high DiskIO means from sp_who2?

I'm trying to troubleshoot some intermittent slowdowns in our application. I've got a separate question here with more details.
I ran sp_who2 to and I've noticed a few connections that have a status of SUSPENDED and high DiskIO. Can someone explain to me what that indicates?

This is a very broad question, so I am going to give a broad answer.
A query gets suspended when it is requesting access to a resource that is currently not available. This can be a logical resource like a locked row or a physical resource like a memory data page. The query starts running again, once the resource becomes available.
High disk IO means that a lot of data pages need to be accessed to fulfill the request.
That is all that I can tell from the above screenshot. However, if I were to speculate, you probably have an IO subsystem that is too slow to keep up with the demand. This could be caused by missing indexes or an actually too slow disk. Keep in mind, that 15000 reads for a single OLTP query is slightly high but not uncommon.

Suspended. The session is waiting for an event, such as I/O, to complete.
http://msdn.microsoft.com/en-us/library/ms174313.aspx

Run sp_who2 to find the suspended spid's
Then right click on the server name and open "Activity Monitor"
In Activity Monitor in the Processes section, look for that spid in the "Blocked By" column
That will tell you which process is preventing your suspended process from running

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight