SQL Server :: Replication Distribution Agent never ending - sql-server

I'm running SQL Server 2019 Always ON Availability Group with an asynchronous replication.
I use a free tool called IDERA SQL Check and I have spotted the SPID 69 which program name is Replication Distribution Agent. It's always there, staring at me like a bored cat.
This SPID 69 is pointing to a specific database which is mirrored I investigated it with this the query:
select
s.session_id
,login_name
,login_time
,host_name
,program_name
,status
,cpu_time
,memory_usage
,total_scheduled_time
,total_elapsed_time
,last_request_start_time
,reads
,writes
,logical_reads
from sys.dm_exec_sessions s
inner join sys.dm_exec_connections c
on s.session_id = c.session_id
outer apply sys.dm_exec_sql_text(c.most_recent_sql_handle) st
where s.is_user_process = 1
and s.open_transaction_count > 0;
Which gave me this response:
session_id = 69
text = begin tran
login_time = 2020-09-08 18:40:57.153
program_name = Replication Distribution Agent
status = sleeping
cpu_time = 1362772
memory_usage = 4
total_scheduled_time = 1689634
total_elapsed_time = 22354857
last_request_start_time = 2020-09-28 16:28:39.433
reads = 18607577
writes = 5166597
logical_reads = 112256365
Now, on internet I find that when you see Replication Distribution Agent is all good, that agent should be going and there should be no problem. But why:
The text says begin tran and nothing more?
IDERA SQL Check is labelling it as connection idling transaction?
The status is sleeping?
I'm concerned that CPU time, reads and writes are basically telling me that this process is frying the drive with never ending I/O, am I right?

This is perfectly normal.
The replication distribution agent is effectively running continuously to scan the transactions on your source to be able to send them to the replicas. Because it needs to capture these and forward them, it has to run continuously.
It is not frying your drive - unless your transaction rate is so high that that is actually frying your drive. It shows high reads in an incremental manner - this is cumulative values and not a snapshot of current. That suggests that it has read the equivalent of 141GB over 20 days - not particularly heavy use.

Related

How to perform XA transactions using SQLServer client ODBC driver in C/C++ on Linux?

I've managed to perform XA transactions using SQLServer OLEDB driver on Windows. Now I've ported the C++ application to Linux. On Linux Microsoft provides the SQLServer 2019 ODBC driver and since version 17.3 of this driver, XA is reported to be supported.
Microsoft provides following example that illustrates how to implement the xa_* functions:
Using XA Transactions
The example itself works. Using the code in another context doesn't work. The call to SQLSetConnectAttr(..., SQL_ATTR_ENLIST_IN_XA, ...) for operation OP_START fails and I don't get useful information by CheckRC().
How to get more informaton about failing SQL_ATTR_ENLIST_IN_XA?
How does XA work with the SQL_ATTR_ENLIST_IN_XA approach compared to OLEDB?
Is it possible to change the isolation level in XA mode?
Share your experiences and details with us, please.
Strict XID data layout
The SQLSetConnectAttr(..., SQL_ATTR_ENLIST_IN_XA, ...) function is very sensitive regarding the XID. If the XID has a branch ID then the branch ID must start at byte 64 of xid_t::data. Storing a global ID like "f9707929-a367-4e3a-9a80-3fbb3a23ab11" + branch ID "1234" directly in one sequence and identifiying the string layout by xid_t::gtrid_length and xid_t::bqual_length will work with other DB APIs and IBM MQ, but it fails with SQL_ATTR_ENLIST_IN_XA in SQLServer.
In order to get above sample XID work the UUID has to be stored at the beginning of xid_t::data (byte 0-36) and the branch id has to be stored staring at byte (64-68). The xid_t field gtrid_length has to be set to 36 and bqual_length to 4. The formatID I set to 1.
If the XID layout doesn't fit the SQL_ATTR_ENLIST_IN_XA with operation OP_START fails and SQLGetDiagRec() reports nothing about it.
Setting the isolation level
By default a XA transaction runs under isolation level "Serializable". Microsoft describes this isolation leve as follows:
The highest level where transactions are completely isolated from one another. The SQLServer keeps read and write locks acquired on selected data to be released at the end of the transaction. Range-locks are acquired when a SELECT operation uses a ranged WHERE clause, especially to avoid phantom reads.
On each call to xa_start the isolation level is set to "Serializable". Setting the isolation level using SQLSetConnectAttr(..., SQL_ATTR_TXN_ISOLATION, ...) after connect doesn't help. You have to call this after SQLSetConnectAttr(..., SQL_ATTR_ENLIST_IN_XA, OP_START, ...).
Doing so allows you to set the isolation level for instance to SQL_TXN_READ_COMMITTED. The database option READ_COMMITTED_SNAPSHOT will be also be considered. This means setting isolation level to SQL_TXN_READ_COMMITTED and having database option READ_COMMITTED_SNAPSHOT enabled will switch the isolation level to "Snapshot".
The command "DBCC useroptions" can be used to query the isolation level of the current session.
Following query is also useful for checking the isolation level and status of active transactions:
SELECT tst.session_id, [database_name] = db_name(s.database_id)
, tat.transaction_begin_time
, transaction_duration_s = datediff(s, tat.transaction_begin_time, sysdatetime())
, transaction_type = CASE tat.transaction_type WHEN 1 THEN 'Read/write transaction'
WHEN 2 THEN 'Read-only transaction'
WHEN 3 THEN 'System transaction'
WHEN 4 THEN 'Distributed transaction' END
, input_buffer = ib.event_info, tat.transaction_uow
, transaction_state = CASE tat.transaction_state
WHEN 0 THEN 'The transaction has not been completely initialized yet.'
WHEN 1 THEN 'The transaction has been initialized but has not started.'
WHEN 2 THEN 'The transaction is active - has not been committed or rolled back.'
WHEN 3 THEN 'The transaction has ended. This is used for read-only transactions.'
WHEN 4 THEN 'The commit process has been initiated on the distributed transaction.'
WHEN 5 THEN 'The transaction is in a prepared state and waiting resolution.'
WHEN 6 THEN 'The transaction has been committed.'
WHEN 7 THEN 'The transaction is being rolled back.'
WHEN 8 THEN 'The transaction has been rolled back.' END
, trn_iso_level = CASE s.transaction_isolation_level
WHEN 0 THEN 'Unspecified'
WHEN 1 THEN 'ReadUncommitted'
WHEN 2 THEN 'ReadCommitted'
WHEN 3 THEN 'RepeatableRead'
WHEN 4 THEN 'Serializable'
WHEN 5 THEN 'Snapshot' END
, transaction_name = tat.name, request_status = r.status
, tst.is_user_transaction, tst.is_local
, session_open_transaction_count = tst.open_transaction_count
, s.host_name, s.program_name, s.client_interface_name, s.login_name, s.is_user_process
FROM sys.dm_tran_active_transactions tat
INNER JOIN sys.dm_tran_session_transactions tst on tat.transaction_id = tst.transaction_id
INNER JOIN Sys.dm_exec_sessions s on s.session_id = tst.session_id
LEFT OUTER JOIN sys.dm_exec_requests r on r.session_id = s.session_id
CROSS APPLY sys.dm_exec_input_buffer(s.session_id, null) AS ib;
The Advantage of SQLServer ODBC driver SQL_ATTR_ENLIST_IN_XA
Implementing SQLServer XA with the OLEDB driver and ITransactionJoin interface directly communicates with the local distributed transaction controller (DTC) service. In case the SQLServer is running on another host then the local DTC and the DTC on the SQLServer host are involved. The DTC service must communicate over network. RPC, dynamic port ranges, firewall and security settings often makes this very difficult getting it to work.
With the new ODBC SQL_ATTR_ENLIST_IN_XA interace the DTC to DTC communication is no longer needed. The appliction has only a connection to the SQLServer database instance and on the SQLServer host the DTC service must run and the "XA option" must be set in this DTC. The application that utilizes SQL_ATTR_ENLIST_IN_XA doesn't require a local DTC.

What accounts for different execution times between HeidiSQL and SSMS?

When I execute a particular query from Heidi against an MSSQL database, it takes approximately 10 times longer than executing the identical query in SSMS.
They are both being executed against the same server from the same workstation.
What can account for this difference?
Here is the exact query and relative execution times:
SELECT b.ID as BookingID, b.ReservationID, b.RoomID, b.EventName,
b.EventTypeID, b.StatusID, b.DateAdded, build.Value1 as BuildingID,
build.ValueDescription as Building
FROM EMS.dbo.tblBooking b
INNER JOIN EMS.dbo.tblRoom room
ON room.ID = b.RoomID
INNER JOIN ( SELECT deff.Value1, deff.ValueDescription
FROM tblDataExtractionFilter_Fields deff
INNER JOIN tblDataExtractionFilter def ON deff.FilterID = def.ID
WHERE def.Description = '[redacted]'
AND deff.FieldID = 28
AND deff.Show = 0) build
ON room.BuildingID = build.Value1
WHERE b.DateAdded > DATEADD(DAY,-7,GETDATE()) AND (StatusID = 1 OR StatusID = 16);
Heidi: "Duration for 1 query: 1.015 sec."
SSMS: "00:00:01"
I am obviously green, but my understanding was that the execution plan was determined server side and not application side. This leads me to suspect that there is some sort of overhead in Heidi with respect to this query (simpler queries execute MUCH faster so the overhead would not be universal).
This is just a point of curiosity for me. I am still learning. Can anyone offer a clue about what I can check/google/research to try to understand this?
Thanks!
EDIT: The times I have reported do not agree with my statement that the SSMS time is 1/10 that of Heidi. They are both approximately 1 second. My subjective wait time (wall clock time) between execution and display is MUCH faster (and much less than 1 second) in SSMS. Can this be due to SSMS caching the results?

QDS not showing anything while DTU is maxed out

I've trying to identify which query is causing my workload to stall, according to the metrics (Metrics (preview) tab in Azure Portal) I see: 100% DTU utilization, caused by the CPU
But when I go to QDS I see a different picture:
And the reported queries by QDS in this period don't take that as long as the DTU cap is being hit.
I know that the 1 minute reported by the metrics view is the correct one, since the operation from the user side takes that long and I can see in the Web App telemetry the app not responding in this time period.
So how can I identify the query that hits the DTU limit?
P.S. The db is an S0.
UPDATE
#Alberto Morillo, I've executed the query, it there are a lot of cheap queries ran (~2k) - the largest values for total_worker_time are in the 54k (54 ms). On the other hand I see the wait stats is dominated by SOS_WORK_DISPATCHER.
Does this mean that the queries are blocking because the workers can't be spawned by the scheduler that fast?
Please run the following query:
SELECT TOP 10 q.query_id, p.plan_id,
rs.count_executions,
qsqt.query_sql_text,
CONVERT(NUMERIC(10,2), (rs.avg_cpu_time/1000)) as 'avg_cpu_time_seconds',
CONVERT(NUMERIC(10,2),(rs.avg_duration/1000)) as 'avg_duration_seconds',
CONVERT(NUMERIC(10,2),rs.avg_logical_io_reads ) as 'avg_logical_io_reads',
CONVERT(NUMERIC(10,2),rs.avg_logical_io_writes ) as 'avg_logical_io_writes',
CONVERT(NUMERIC(10,2),rs.avg_physical_io_reads ) as 'avg_physical_io_reads',
CONVERT(NUMERIC(10,0),rs.avg_rowcount ) as 'avg_rowcount'
from sys.query_store_query q
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
INNER JOIN sys.query_store_query_text qsqt
ON q.query_text_id = qsqt.query_text_id
WHERE rs.last_execution_time > dateadd(hour, -1, getutcdate())
ORDER BY rs.avg_duration DESC
Change the ORDER BY clause to avg_cpu_time and avg_rowcount also.

How to find replication lag in SQL Server?

In MYSQL server by looking into the value for "second behind master" it can be known by how much a slave server is lagging behind to its master. So, is there something similar to it in MSSQL so that it can be known how a slave server is lagging behind by its master?
There is some controversy on this subject, but I like to use regularly posted tracer tokens. That is, you call the sp_posttracertoken procedure on the publisher and it will send a, well, token all the way through to the subscriber. You can see the history of all tokens in the distributor database. I wrote the following view to make the data a little easier to grok:
create view [dbo].[tokens] as
select
ps.name as [publisher],
p.publisher_db,
p.publication,
ss.name as [subscriber],
da.subscriber_db,
t.publisher_commit,
t.distributor_commit,
h.subscriber_commit,
datediff(second, t.publisher_commit, t.distributor_commit) as [pub to dist (s)],
datediff(second, t.distributor_commit ,h.subscriber_commit) as [dist to sub (s)],
datediff(second, t.publisher_commit, h.subscriber_commit) as [total latency (s)]
from mstracer_tokens t
inner join MStracer_history h
on t.tracer_id = h.parent_tracer_id
inner join mspublications p
on p.publication_id = t.publication_id
inner join sys.servers ps
on p.publisher_id = ps.server_id
inner join msdistribution_agents da
on h.agent_id = da.id
inner join sys.servers ss
on da.subscriber_id = ss.server_id
Another approach is to use what's commonly called a canary table. The idea is that you have a table specifically to monitor replication that typically only has one row with a datetime field. You update the column at the publisher and then you monitor how far behind the subscriber is by seeing what the value of that column is at the subscriber.
Lastly, there are some perfmon counters that you can look at. In my experience, they're not that great; the number of outstanding commands is an accurate number, but the measurement of latency as a duration is typically very inaccurate.

close/kill transaction

I have this open transaction, according to DBCC OPENTRAN:
Oldest active transaction:
SPID (server process ID) : 54
UID (user ID) : -1
Name : UPDATE
LSN : (4196:12146:1)
Start time : Jul 20 2011 12:44:23:590PM
SID : 0x01
Is there a way to kill it/ roll it back?
You should first figure out what it was doing, where it came from, and if applicable how much longer it might be expected to run:
SELECT
r.[session_id],
c.[client_net_address],
s.[host_name],
c.[connect_time],
[request_start_time] = s.[last_request_start_time],
[current_time] = CURRENT_TIMESTAMP,
r.[percent_complete],
[estimated_finish_time] = DATEADD
(
MILLISECOND,
r.[estimated_completion_time],
CURRENT_TIMESTAMP
),
current_command = SUBSTRING
(
t.[text],
r.[statement_start_offset]/2,
COALESCE(NULLIF(r.[statement_end_offset], -1)/2, 2147483647)
),
module = COALESCE(QUOTENAME(OBJECT_SCHEMA_NAME(t.[objectid], t.[dbid]))
+ '.' + QUOTENAME(OBJECT_NAME(t.[objectid], t.[dbid])), '<ad hoc>'),
[status] = UPPER(s.[status])
FROM
sys.dm_exec_connections AS c
INNER JOIN
sys.dm_exec_sessions AS s
ON c.session_id = s.session_id
LEFT OUTER JOIN
sys.dm_exec_requests AS r
ON r.[session_id] = s.[session_id]
OUTER APPLY
sys.dm_exec_sql_text(r.[sql_handle]) AS t
WHERE
c.session_id = 54;
If you are confident that you can sever this connection you can use:
KILL 54;
Just be aware that depending on what the session was doing it could leave data and/or the app that called it in a weird state.
In cases of deadlock, the following query should be run at regular intervals.
DBCC opentran()
If then the same SPID number is returned multiple times in the following report
Oldest active transaction:
SPID (server process ID): 131
UID (user ID) : -1
Name : implicit_transaction
LSN : (634998:226913:1)
Start time : Jan 19 2022 6:36:36:360PM
SID : 0x010500000000000515000000c6bb507a9dbeda5275b975547b3e0000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
Completion time: 2022-01-19T18:36:38.8421769+03:00
Then make a detail query for this transaction. It is critical to permanently resolve the source of this problem.
exec sp_who2 131
exec sp_lock 131
After investigating the cause, you can resolve the deadlock by killing that process.
KILL 131
If you want to see all SPIDs and blocked as tables, you should use the following query.
SELECT spid, blocked,[dbid],last_batch,open_tran
FROM master.sys.sysprocesses
WHERE open_tran <> 0
I ended up running into the situation of locking up a sessions as reported by DBCC OPENTRAN but due to the corporate lock down of the Server/database my ability to KILL was not available.
I discovered that the app I was using to execute the script(s), VS 2022, was complicit, so to speak, in keeping the transactions alive. By closing the app, it notified me that there were active sessions running and that closing could have consequences. By accepting the notifications and closing the app, the open transactions would subsequently be closed.

Resources