Azure SQL frequent connection timeouts

Azure SQL frequent connection timeouts - sql-server

We're running a web app (2 instances) on Azure, backed by a SQL Azure database. At any given time there are 50-150 users using the website. The database runs at S2 performance level. The DTU is around 20% on average.
However, a few times every day I suddenly get hundreds of errors in my logs with timeouts, like this:
An error occurred while executing the command definition. See the inner exception for details.
The wait operation timed out.
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. This failure occurred while attempting to connect to the routing destination. The duration spent while attempting to connect to the original server was - [Pre-Login] initialization=1; handshake=21; [Login] initialization=0; authentication=0; [Post-Login] complete=1;
We're using EF6 for queries with the default command timeout. I've configured this execution strategy:
SetExecutionStrategy("System.Data.SqlClient",
() => new SqlAzureExecutionStrategy(10, TimeSpan.FromSeconds(15)));
The database (about 15GB total) is heavily indexed. These errors occur all over the place, usually dozens to hundreds within 1-2 minutes.
What steps can I take to prevent this from happening?

The fact that it happens in 1-2 minutes might mean a burst in activity or some process that might be locking up tables.
If your DTU during those times is at 20% is not a CPU issue, but you can always find which are the bottlenecks by running this query on the DB:
SELECT TOP 10
total_worker_time/execution_count AS Avg_CPU_Time
,execution_count
,total_elapsed_time/execution_count as AVG_Run_Time
,(SELECT
SUBSTRING(text,statement_start_offset/2,(CASE
WHEN statement_end_offset = -1 THEN LEN(CONVERT(nvarchar(max), text)) * 2
ELSE statement_end_offset
END -statement_start_offset)/2
) FROM sys.dm_exec_sql_text(sql_handle)
) AS query_text
FROM sys.dm_exec_query_stats
ORDER BY Avg_CPU_Time DESC
Even if the DB is heavily indexed, indexes get fragmented, I'd advice running this to check the current fragmentation:
select a.*,b.AverageFragmentation from
( SELECT tbl.name AS [Table_Name], tbl.object_id, i.name AS [Name], i.index_id, CAST(CASE i.index_id WHEN 1 THEN 1 ELSE 0 END AS bit) AS [IsClustered],
CAST(case when i.type=3 then 1 else 0 end AS bit) AS [IsXmlIndex], CAST(case when i.type=4 then 1 else 0 end AS bit) AS [IsSpatialIndex]
FROM
sys.tables AS tbl
INNER JOIN sys.indexes AS i ON (i.index_id > 0 and i.is_hypothetical = 0) AND (i.object_id=tbl.object_id))a
inner join
( SELECT tbl.object_id, i.index_id, fi.avg_fragmentation_in_percent AS [AverageFragmentation]
FROM
sys.tables AS tbl
INNER JOIN sys.indexes AS i ON (i.index_id > 0 and i.is_hypothetical = 0) AND (i.object_id=tbl.object_id)
INNER JOIN sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') AS fi ON fi.object_id=CAST(i.object_id AS int) AND fi.index_id=CAST(i.index_id AS int)
)b
on a.object_id=b.object_id and a.index_id=b.index_id
order by AverageFragmentation desc
You can also use Azure Automation to schedule an automatic rebuilding of fragmented indexes, see answer at: Why my Azure SQL Database indexes are still fragmented?

Related

SymmetricDS Tables not syncing

I took over a project that uses SymmetricDs to sync tables between a target and a source. The tables from the source aren't syncing to the target at the moment.
I have searched online but found no help.
I have checked the sym_outgoing_batch and sym_incoming_batch tables but can't figure out the use of the information there.
I also queried the sync_trigger table. I have the result of the query as a link below.
If you have an idea on where I could look, please let me know. I can run queries and give you the result.sync_trigger result

Uncomment these lines "--where status != 'OK'" to see if anything is NOT in OK state if there is that is causing the SYNC to STOP
-- SQL QUERY
-- Symmetric DS : MONITOR : HEARTBEAT / INCOMING / OUTGOING / MONITOREVENTS
SELECT node_id, host_name, getdate() as dtNOW ,heartbeat_time FROM [tablename].[dbo].[sd_node_host] with (NOLOCK) where heartbeat_time > '2022-01-01'
--SELECT * FROM [tablename].[dbo].[sd_context] with (NOLOCK)
SELECT * FROM [tablename].[dbo].[sd_outgoing_batch] with (NOLOCK)
--where status != 'OK'
order by create_time desc
SELECT * FROM [tablename].[dbo].[sd_incoming_batch] with (NOLOCK)
--where status != 'OK'
order by create_time desc
-- Symmetric DS : MONITOR
SELECT * FROM [tablename].[dbo].[sd_monitor_event] with (NOLOCK) WHERE is_resolved != 1
SELECT * FROM [tablename].[dbo].[sd_monitor_event] with (NOLOCK)
EDIT
here is the link for the symmetricds user guide.
https://www.symmetricds.org/doc/3.13/html/user-guide.html#_outgoing_batch
basically NE means it is ready for replication. Did you ever have it set up or is this a new setup that you are trying to get started?
EDIT 2 ENGINE CONFIGS
MAIN
engine.name=<SDS_MAIN>
db.driver=net.sourceforge.jtds.jdbc.Driver
db.url=jdbc:jtds:sqlserver://<IP>:1433/<DB>;useCursors=true;bufferMaxMemory=10240;lobBuffer=5242880
db.user=***********
db.password=***********
registration.url=
sync.url=ttp://<IP>:<PORT>/sync/<SDS_MAIN>
group.id=<GID>
external.id=000
auto.registration=true
initial.load.create.first=true
sync.table.prefix=sym
#start.initial.load.extract.job=false
compression.level=-1
compression.strategy=0
CHILD
engine.name=<SDS_CHILD>
db.driver=net.sourceforge.jtds.jdbc.Driver
db.url=jdbc:jtds:sqlserver://<IP>:1433/<DB>;useCursors=true;bufferMaxMemory=10240;lobBuffer=5242880
db.user=***********
db.password=***********
registration.url=http://<IP>:<PORT>/sync/<SDS_MAIN>
sync.url=http://<IP>:<PORT>/sync/<SDS_CHILD>
group.id=<GID>
external.id=100
auto.registration=true
initial.load.create.first=true
sync.table.prefix=sym
start.initial.load.extract.job=false
compression.level=-1
compression.strategy=0

Azure SQL: Can tempdb size cause a large query to go exceed its size limit?

I have two BI reporting servers (one is Development, the other Production) each with their own own Azure SQL instance, I am using Pentaho to automate the ETL process and part of my process is to execute a query joining two tables of similar data in order to join a user's login row with their logout row.
with login_cte as
(
select
os.[k_actions]
,[k_hosts]
,[k_logger_names]
,[k_principals]
,[k_sites]
,[k_roles]
,[k_date] as k_date_logged_in
,[k_time] as k_time_logged_in
,[k_date_ct] as k_date_ct_logged_in
,[k_time_ct] as k_time_ct_logged_in
,[key] as key_logged_in
,[type]
,[id]
,[sort]
,[level]
,[timestamp_utc] as timestamp_utc_logged_in
,[timestamp_ct] as timestamp_ct_logged_in
,[service]
,[version]
,[message] as message_logged_in
,[thread_name]
,[db_name]
,[customer_id]
from other_stage os
left join dim_actions da on os.k_actions = da.k_actions
where [action_name] = 'SET_ROLE_DEVICE_USER'
order by [key] asc
offset 0 rows
),
logout_cte as
(
select
[k_actions]
,[k_hosts] as k_hosts_out
,[k_logger_names] as k_logger_names_out
,[k_principals] as k_principals_out
,[k_sites] as k_sites_out
,null as k_roles_out
,[k_date] as k_date_logged_out
,[k_time] as k_time_logged_out
,[k_date_ct] as k_date_ct_logged_out
,[k_time_ct] as k_time_ct_logged_out
,[key] as key_logged_out
,[type] as type_out
,[id] as id_out
,[sort] as sort_out
,[level] as level_out
,[timestamp_utc] as timestamp_utc_logged_out
,[timestamp_ct] as timestamp_ct_logged_out
,[service] as service_out
,[version] as version_out
,[message] as message_logged_out
,[thread_name] as thread_name_out
,[db_name]
,[customer_id]
from logout_stage
order by [key] asc
offset 0 rows
)
select
li.k_actions
,li.k_hosts
,li.k_logger_names
,li.k_principals
,li.k_roles
,li.k_sites
,li.k_date_logged_in
,li.k_time_logged_in
,lo.k_date_logged_out
,lo.k_time_logged_out
,li.k_date_ct_logged_in
,li.k_time_ct_logged_in
,lo.k_date_ct_logged_out
,lo.k_time_ct_logged_out
,li.key_logged_in
,lo.key_logged_out
,li.type
,li.id
,li.sort
,li.level
,li.timestamp_utc_logged_in
,li.timestamp_ct_logged_in
,lo.timestamp_utc_logged_out
,lo.timestamp_ct_logged_out
,li.service
,li.version
,li.message_logged_in
,lo.message_logged_out
,li.thread_name
,li.db_name
,li.customer_id
,datediff(second, li.timestamp_utc_logged_in, lo.timestamp_utc_logged_out) as login_duration
,row_number() over (partition by li.key_logged_in order by lo.key_logged_out asc) as rn
from login_cte li left join logout_cte lo
on li.k_principals = lo.k_principals_out
and li.db_name = lo.db_name
and lo.key_logged_out > li.key_logged_in
and lo.timestamp_utc_logged_out > li.timestamp_utc_logged_in
order by lo.key_logged_out asc
The issue I'm having is that this query runs just fine by itself in my production instance but when run on the development instance, which has fewer rows in each table (~70k vs ~124k), the query fails with this error.
SQLServerException: The database 'tempdb' has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.
Doing some digging into the tempdb size on each I ran this query:
SELECT SUM(unallocated_extent_page_count) AS [free pages],
(SUM(unallocated_extent_page_count)*1.0/128) AS [free space in MB]
FROM tempdb.sys.dm_db_file_space_usage;
With these results:
Production:
free pages
free space in MB
923,792
7,217.125
Development:
free pages
free space in MB
8,386,864
65,522.375
while the query is running I can run file space usage query again and watch the free space slow go done to zero then the query fails. Meanwhile the same query on Production shows the free space go down but never hits zero and will succeed.
I am fairly new to Azure SQL and have tried searching for an answer but can only find answers on how to find a query that is using up the tempdb (of which I already know...) but not any on what would cause the same query to fail on one and not the other. I do not have access to the Azure console but can connect to the DB with a power user through either DBeaver or Microsoft SQL Server Management Studio.
Edit 12-14-21:
Paste of the query execution plan

QDS not showing anything while DTU is maxed out

I've trying to identify which query is causing my workload to stall, according to the metrics (Metrics (preview) tab in Azure Portal) I see: 100% DTU utilization, caused by the CPU
But when I go to QDS I see a different picture:
And the reported queries by QDS in this period don't take that as long as the DTU cap is being hit.
I know that the 1 minute reported by the metrics view is the correct one, since the operation from the user side takes that long and I can see in the Web App telemetry the app not responding in this time period.
So how can I identify the query that hits the DTU limit?
P.S. The db is an S0.
UPDATE
#Alberto Morillo, I've executed the query, it there are a lot of cheap queries ran (~2k) - the largest values for total_worker_time are in the 54k (54 ms). On the other hand I see the wait stats is dominated by SOS_WORK_DISPATCHER.
Does this mean that the queries are blocking because the workers can't be spawned by the scheduler that fast?

Please run the following query:
SELECT TOP 10 q.query_id, p.plan_id,
rs.count_executions,
qsqt.query_sql_text,
CONVERT(NUMERIC(10,2), (rs.avg_cpu_time/1000)) as 'avg_cpu_time_seconds',
CONVERT(NUMERIC(10,2),(rs.avg_duration/1000)) as 'avg_duration_seconds',
CONVERT(NUMERIC(10,2),rs.avg_logical_io_reads ) as 'avg_logical_io_reads',
CONVERT(NUMERIC(10,2),rs.avg_logical_io_writes ) as 'avg_logical_io_writes',
CONVERT(NUMERIC(10,2),rs.avg_physical_io_reads ) as 'avg_physical_io_reads',
CONVERT(NUMERIC(10,0),rs.avg_rowcount ) as 'avg_rowcount'
from sys.query_store_query q
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
INNER JOIN sys.query_store_query_text qsqt
ON q.query_text_id = qsqt.query_text_id
WHERE rs.last_execution_time > dateadd(hour, -1, getutcdate())
ORDER BY rs.avg_duration DESC
Change the ORDER BY clause to avg_cpu_time and avg_rowcount also.

Automatically stop SQL Server Agent when no jobs are running [duplicate]

I have around 40 different sql server jobs in one instance. They all have different schedules. Some run once a day some every two mins some every five mins. If I have a need to stop sql server agent, how can I find the best time when no jobs are running so I won't interrupt any of my jobs?

how can I find the best time when no jobs are running so I won't interrupt any of my jobs?
You basically want to find a good window to perform some maintenance. #MaxVernon has blogged about it here with a handy script
/*
Shows gaps between agent jobs
-- http://www.sqlserver.science/tools/gaps-between-sql-server-agent-jobs/
-- requires SQL Server 2012+ since it uses the LAG aggregate.
Note: On SQL Server 2005, SQL Server 2008, and SQL Server 2008 R2, you could replace the LastEndDateTime column definition with:
LastEndDateTime = (SELECT TOP(1) s1a.EndDateTime FROM s1 s1a WHERE s1a.rn = s1.rn - 1)
*/
DECLARE #EarliestStartDate DATETIME;
DECLARE #LatestStopDate DATETIME;
SET #EarliestStartDate = DATEADD(DAY, -1, GETDATE());
SET #LatestStopDate = GETDATE();
;WITH s AS
(
SELECT StartDateTime = msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time)
, MaxDuration = MAX(sjh.run_duration)
FROM msdb.dbo.sysjobs sj
INNER JOIN msdb.dbo.sysjobhistory sjh ON sj.job_id = sjh.job_id
WHERE sjh.step_id = 0
AND msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time) >= #EarliestStartDate
AND msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time) < = #LatestStopDate
GROUP BY msdb.dbo.agent_datetime(sjh.run_date, sjh.run_time)
UNION ALL
SELECT StartDate = DATEADD(SECOND, -1, #EarliestStartDate)
, MaxDuration = 1
UNION ALL
SELECT StartDate = #LatestStopDate
, MaxDuration = 1
)
, s1 AS
(
SELECT s.StartDateTime
, EndDateTime = DATEADD(SECOND, s.MaxDuration - ((s.MaxDuration / 100) * 100)
+ (((s.MaxDuration - ((s.MaxDuration / 10000) * 10000))
- (s.MaxDuration - ((s.MaxDuration / 100) * 100))) / 100) * 60
+ (((s.MaxDuration - ((s.MaxDuration / 1000000) * 1000000))
- (s.MaxDuration - ((s.MaxDuration / 10000) * 10000))) / 10000) * 3600, s.StartDateTime)
FROM s
)
, s2 AS
(
SELECT s1.StartDateTime
, s1.EndDateTime
, LastEndDateTime = LAG(s1.EndDateTime) OVER (ORDER BY s1.StartDateTime)
FROM s1
)
SELECT GapStart = CONVERT(DATETIME2(0), s2.LastEndDateTime)
, GapEnd = CONVERT(DATETIME2(0), s2.StartDateTime)
, GapLength = CONVERT(TIME(0), DATEADD(SECOND, DATEDIFF(SECOND, s2.LastEndDateTime, s2.StartDateTime), 0))
FROM s2
WHERE s2.StartDateTime > s2.LastEndDateTime
ORDER BY s2.StartDateTime;

The question title scared me a bit - I thought you wanted to programmatically shut the SQL Server agent down anytime there were no jobs running. My answer to that question would be "Why?" There is no need to.
But if you are just looking to do a planned restart or shut down and you don't have a third party tool like Sentry One's SQL Sentry Event Manager to have a visualization, I would just let the SQL Server Agent Job History and Job Activity Monitor help here. The Job Activity monitor can show you which jobs are running right now in the status column. You can also see the last execute and next execute dates and times.
In the object browser in SSMS, connect to your instance, then expand SQL Server Agent, then you'll see Jobs and under that you'll see "Job Activity Monitor" - this view should show you what you need.
Also - don't worry about shutting down before a job executes. If you do that, you will either have that job just missing its schedule and you can let it run when it is next due to (depending on the job and its purpose) or you can manually right click and execute the job.
For more on the activity monitor for jobs, see Monitor Job Activity in the product documentation.

I recommend creating a script that will disable your jobs. Disabled jobs still exist but will not be automatically launched by their schedules. Run this script (based on procedure sp_update_job in the msdb database) to disable jobs, wait for any currently running jobs to finish execution, then stop SQL agent. A similar script to re-enable disabled jobs would be useful. You might need to plan around jobs that are and should remain disabled.
A complete “SQL Agent shutdown” process could be fully scripted, but I question the wisdom of doing so. A bit of research implies that there is no 100% reliable way of programmatically telling if a given job is or is not running, and while there is an undocumented (where "undocumented" means "you really shouldn't be using this") system procedure for stopping and starting services, doing so from with SQL Server itself seems like a pretty bad idea.

You can query the system tables as shown by Dattatrey Sindol in the MSSQLTips.com article Querying SQL Server Agent Job Information:
SELECT
[sJOB].[job_id] AS [JobID]
, [sJOB].[name] AS [JobName]
, [sDBP].[name] AS [JobOwner]
, [sCAT].[name] AS [JobCategory]
, [sJOB].[description] AS [JobDescription]
, CASE [sJOB].[enabled]
WHEN 1 THEN 'Yes'
WHEN 0 THEN 'No'
END AS [IsEnabled]
, [sJOB].[date_created] AS [JobCreatedOn]
, [sJOB].[date_modified] AS [JobLastModifiedOn]
, [sSVR].[name] AS [OriginatingServerName]
, [sJSTP].[step_id] AS [JobStartStepNo]
, [sJSTP].[step_name] AS [JobStartStepName]
, CASE
WHEN [sSCH].[schedule_uid] IS NULL THEN 'No'
ELSE 'Yes'
END AS [IsScheduled]
, [sSCH].[schedule_uid] AS [JobScheduleID]
, [sSCH].[name] AS [JobScheduleName]
, CASE [sJOB].[delete_level]
WHEN 0 THEN 'Never'
WHEN 1 THEN 'On Success'
WHEN 2 THEN 'On Failure'
WHEN 3 THEN 'On Completion'
END AS [JobDeletionCriterion]
FROM
[msdb].[dbo].[sysjobs] AS [sJOB]
LEFT JOIN [msdb].[sys].[servers] AS [sSVR]
ON [sJOB].[originating_server_id] = [sSVR].[server_id]
LEFT JOIN [msdb].[dbo].[syscategories] AS [sCAT]
ON [sJOB].[category_id] = [sCAT].[category_id]
LEFT JOIN [msdb].[dbo].[sysjobsteps] AS [sJSTP]
ON [sJOB].[job_id] = [sJSTP].[job_id]
AND [sJOB].[start_step_id] = [sJSTP].[step_id]
LEFT JOIN [msdb].[sys].[database_principals] AS [sDBP]
ON [sJOB].[owner_sid] = [sDBP].[sid]
LEFT JOIN [msdb].[dbo].[sysjobschedules] AS [sJOBSCH]
ON [sJOB].[job_id] = [sJOBSCH].[job_id]
LEFT JOIN [msdb].[dbo].[sysschedules] AS [sSCH]
ON [sJOBSCH].[schedule_id] = [sSCH].[schedule_id]
ORDER BY [JobName]

Finding what is writing to the Transaction Log in SQL Server?

Is there a way to see what is writing to the transaction log?
I have a log file that has grown 15 Gigs in the last 20 minutes. Is there a way for me to track down what is causing this?

Activity monitor will show you what is executing.
DBCC OPENTRAN will show the oldest open transaction.
There is also the dynamic management view sys.dm_tran_active_transactions. For example, here's a query that shows you log file usage by process:
-- This query returns log file space used by all running transactions.
select
SessionTrans.session_id as [SPID],
enlist_count as [Active Requests],
ActiveTrans.transaction_id as [ID],
ActiveTrans.name as [Name],
ActiveTrans.transaction_begin_time as [Start Time],
case transaction_type
when 1 then 'Read/Write'
when 2 then 'Read-Only'
when 3 then 'System'
when 4 then 'Distributed'
else 'Unknown - ' + convert(varchar(20), transaction_type)
end as [Transaction Type],
case transaction_state
when 0 then 'Uninitialized'
when 1 then 'Not Yet Started'
when 2 then 'Active'
when 3 then 'Ended (Read-Only)'
when 4 then 'Committing'
when 5 then 'Prepared'
when 6 then 'Committed'
when 7 then 'Rolling Back'
when 8 then 'Rolled Back'
else 'Unknown - ' + convert(varchar(20), transaction_state)
end as 'State',
case dtc_state
when 0 then NULL
when 1 then 'Active'
when 2 then 'Prepared'
when 3 then 'Committed'
when 4 then 'Aborted'
when 5 then 'Recovered'
else 'Unknown - ' + convert(varchar(20), dtc_state)
end as 'Distributed State',
DB.Name as 'Database',
database_transaction_begin_time as [DB Begin Time],
case database_transaction_type
when 1 then 'Read/Write'
when 2 then 'Read-Only'
when 3 then 'System'
else 'Unknown - ' + convert(varchar(20), database_transaction_type)
end as 'DB Type',
case database_transaction_state
when 1 then 'Uninitialized'
when 3 then 'No Log Records'
when 4 then 'Log Records'
when 5 then 'Prepared'
when 10 then 'Committed'
when 11 then 'Rolled Back'
when 12 then 'Committing'
else 'Unknown - ' + convert(varchar(20), database_transaction_state)
end as 'DB State',
database_transaction_log_record_count as [Log Records],
database_transaction_log_bytes_used / 1024 as [Log KB Used],
database_transaction_log_bytes_reserved / 1024 as [Log KB Reserved],
database_transaction_log_bytes_used_system / 1024 as [Log KB Used (System)],
database_transaction_log_bytes_reserved_system / 1024 as [Log KB Reserved (System)],
database_transaction_replicate_record_count as [Replication Records],
command as [Command Type],
total_elapsed_time as [Elapsed Time],
cpu_time as [CPU Time],
wait_type as [Wait Type],
wait_time as [Wait Time],
wait_resource as [Wait Resource],
reads as [Reads],
logical_reads as [Logical Reads],
writes as [Writes],
SessionTrans.open_transaction_count as [Open Transactions(SessionTrans)],
ExecReqs.open_transaction_count as [Open Transactions(ExecReqs)],
open_resultset_count as [Open Result Sets],
row_count as [Rows Returned],
nest_level as [Nest Level],
granted_query_memory as [Query Memory],
SUBSTRING(SQLText.text,ExecReqs.statement_start_offset/2,(CASE WHEN ExecReqs.statement_end_offset = -1 then LEN(CONVERT(nvarchar(max), SQLText.text)) * 2 ELSE ExecReqs.statement_end_offset end - ExecReqs.statement_start_offset)/2) AS query_text
from
sys.dm_tran_active_transactions ActiveTrans (nolock)
inner join sys.dm_tran_database_transactions DBTrans (nolock)
on DBTrans.transaction_id = ActiveTrans.transaction_id
inner join sys.databases DB (nolock)
on DB.database_id = DBTrans.database_id
left join sys.dm_tran_session_transactions SessionTrans (nolock)
on SessionTrans.transaction_id = ActiveTrans.transaction_id
left join sys.dm_exec_requests ExecReqs (nolock)
on ExecReqs.session_id = SessionTrans.session_id
and ExecReqs.transaction_id = SessionTrans.transaction_id
outer apply sys.dm_exec_sql_text(ExecReqs.sql_handle) AS SQLText
where SessionTrans.session_id is not null -- comment this out to see SQL Server internal processes

If you transaction log has grown so much is such a short time this means that a lot of statements that make data or structure changes have been executed. If your database works with large blob records you can try looking there first.
Profiler won't help you much in finding out what happened previously but it can help you if this is still going on.
If you want to read into transaction log, you will need a 3rd party transaction log reader. The best solution on the market is ApexSQL Log which saved me couple times in similar situations.
However, if your database is running on sql server 2000, you can try to use SQL Log Rescue from Red Gate cause it's free. Thrid solution is to try and find Lumigent Log Explorer (product is discountinued but maybe you can find it somewhere online).
Try'em all and see which one works better for you.

You can use sql server profiler which show every transaction executed and it's start time and end time and many things and i think you can see what causing your problem.
I hope this help you.