The DBs corresponding to the different environments (e.g. production, staging) of our web app are on the same Azure SQL database server. While I read everywhere that an Azure SQL server is just a logical container (and the DBs on it may not even be on the same physical machine) we see signs of DBs behaving as noisy neighbors, i.e. operations made on one of them affecting the performance of others.
We've seen the below operations and metrics happening correlate between DBs. Let's call one DB "prod" and the other "stage"; in all cases stage was created by copying prod with the Start-AzureWebAppSqlDatabaseCopy PowerShell commandlet.
Scaling up stage correlates with Data IO spike on prod.
Running performance-heavy operations on stage (deleting thousands of tables, updating around 10 thousand rows) correlates with SQL connection timeouts ("The timeout period elapsed prior to completion of the operation or the server is not responding.") and Data IO spikes on prod.
With both DBs we use separate DB-level user accounts (on why, see this SO post), but the prod and stage user accounts both exist under both DBs (i.e. we use the stage user to connect to the stage DB, but the stage user also exists under the prod DB, and the prod user also exists under the stage DB). We dropped the stage user from the prod DB to see if that makes a difference, but it didn't.
It may be worth noting that when the Web/Business Azure SQL tiers were phase out these DBs were migrated from Web to their current S1 tier. We see the same issue with DBs on another server too. The DBs are NOT part of an elastic pool.
Our findings are inconclusive and these events don't correlate 100% of the time either. We're out of ideas what to investigate, as we are sure that the stage app doesn't connect to the prod DB. We tried to find evidence of the stage app somehow affecting the prod DB but we couldn't. Any input would be appreciated.
Update 1
Using Grant's sys.dm_os_wait_stats tip, as well as sys.dm_os_performance_counters it is evident that yes, if you make a copy of the database on the same logical server it will be created on the same physical SQL Server too. Server name in object_name is the same, wait values are exactly the same.
This doesn't explain however, why operations on the copy affect the original DB. Since it seems that the noisy neighbor effect doesn't happen all the time (the scale up does affect the original DB most of the time, the perf-heavy operations less so, but the correlation is still pronounced) it might be some random Azure problem.
We'll see if using a different logical server fixes the issue. What's sure is that in that case the physical server will also be different, we've checked that.
Update 2
We're monitoring the situation but whether this indeed solves the issue will most possibly be apparent only after several months. For now we have put all DBs on separate servers.
We did notice timeouts on the prod DB always in the same time interval after all operations on the stage DB completed. These timeouts however seem to only happen for table creations. It's like after copying the prod DB to the stage DB the prod DB is somewhat "locked" for a period of time (about 45-60 minutes) and you can't create tables (but you can drop them, those work). Funnily enough this didn't happen today, so maybe it has resolved itself...
From the information you provide I suspect the issue is the workload of your databases is occasionally I/O intensive, is hitting the tier limits and Azure SQL starts to throttling. That throttling may be behind those timeouts.
Please monitor resource consumption using below query:
SELECT
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
Service level objective (SLO) of 99.9% <= go to next tier.
Measure DTU consumption over time. Are you getting timeouts when the following query shows high DTU usage.
SELECT start_time, end_time,
(SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent),
(avg_log_write_percent)) AS value(v)) as [avg_DTU_percent]
FROM sys.resource_stats where database_name = 'AdventureWorksLT' order by end_time desc
Compare DTU usage vs DTU limit.
SELECT
end_time AS [EndTime]
, (SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v)) AS [AvgDTU_Percent]
, ((dtu_limit)*((SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v))/100.00)) AS [AvgDTUsUsed]
, dtu_limit AS [DTULimit]
FROM sys.dm_db_resource_stats
The way I would go about determining if this was the case would be to use sys.dm_os_wait_stats along with sys.dm_db_wait_stats. The OS wait stats are for the "server" that your database is running on and the db wait stats are for the database. Gather the db waits for both databases in question and the os waits for both databases. First, straight up compare the os waits. if they're the same (with some margin, I wouldn't expect them to be exactly the same, although, if they are, there's your answer), you may be seeing everything on the same server. If they're not really the same, but sort of similar, then compare the db wait stats for each database to the OS wait stats and see if you can see a direct correlation.
For management purposes only, I would probably separate these on to different servers anyway, even if this wasn't normally an issue. However, if you can find a correlation, then the best bet is probably to break apart the servers. It won't cost you anything. You pay for the database, not the server.
Related
so I have this issue. Our client using MS SQL databases. Two months ago they migrated their databases to the SQL Enterprise 2019 from earlier version and Standard edition.
They major reason was to secure high availability through feature in MS SQL - Availability groups.
After that our application get really slowed. In the simply way to tell, customer startup an app select workspace and then its takes like 15 seconds to load data.
First step is just sending request to database to select data - no inserts, deletes or any high performance processes.
App is using and working with geographical and geometry data, every geo objects is saved in database as geometry data type. The first huge, major select is causing the slow issue.
When I was looking at activity mon under wait categories is only one thing suspicious to me and its type Other.
In database I dont see any high cost queries and availability group mode is set to synchronous.
If Im getting this right, the synchronous mode should not be the cause of this problem because this database is clearly for reading a data not as I mentioned modifying.
I made changes to some instance parameters and set Optimize for Ad hoc workloads to True and and threshold for parallelism from 5 to 20.
Other thing which I tried was create a new app source database and database which contains geo data inside of that SQL instance and didnt add them to availability groups.
From application we are using, for test causes, a connection to the one instance with new test databases.
Neither of this settings work. So guys if you have any idea or any experience with this please help me.
Here is a screen of top 10 waits from sys dmv.
1 - Stats recompute...
When you are going from a SQL version to a higher one, you must first change the compatibility level (to have some performance benefits) and then recompute all statistics in the database with a FULLSCAN. Why ? Because each version of SQL Server come with a new optimizer that have new operators, new algorithms and many improvements... To stick to this new version of the optimizer the method of computing statistics and the form of the results of these calculations, is rethought with each modification of the engine ... so much so that if we use the old statistics with a new engine, it is like taking the census of the population in 1930, to plan the construction of roads, schools and hospitals for the current actual population ....
2 - SQL Server Editions...
When upscaling SQL Server from Standard to Enterprise, you need to increase the "hardware" (even if it is a VM) because many of the features that runs under Enterprise version, and does not exists in Standard, needs some more computationnal resources. As an example, using the AUTO_UPDATE_STATISTICS_ASYNC will use automatically one more thread to the detriment of other processes... In comparison, using a Rolls Royce or a Hummer, instead of a VolksWagen is arguably more comfortable, faster ... but requires more oil and more expensive insurance!
3 - Synchronous AVG...
Synchronous AlwaysOn availability groups must have a very fast and faultless network .... If this is not the case, the replication of update requests can drag performance down, especially if you are in pessimistic lockdown (default mode).
4 - Transaction logs...
One common global lack of performances can be the latency to write the transaction log.
5 - Tempdb files...
Another current global lack of performances can be the latency to access tempdb files.
For those two file problems, use the Glenn Berry latency file query that will give you a indice... Good values are under 7 ms for reads and 15 ms for writes...
CONCLUSION
Many other factors can contribute to slow down you system. But without no more information, we cannot help you...
I've setup two SQL DBs on Azure with geo-replication. The primary is in Brazil and a secondary in West Europe.
Similarly I have two web apps running the same web api. A Brazilian web app that reads and writes on the Brazilian DB and a European web app that reads on the European DB and writes in the Brazilian DB.
When I test response times on read-only queries with Postman from Europe, I first notice that on a first "cold" call the European Web app is twice as fast as the Brazilian one. However, immediate next calls response times on the Bazilian web app are 10% of the initial "cold" call whereas response times on the European web app remain the same. I also notice that after a few minutes of inactivity, results are back to the "cold" case.
So:
why do query response times drop in Brazil?
whatever the answer is to 1, why doesn't it happen in Europe?
why does the response times optimization occurring in 1 doesn't last after a few minutes of inactivity?
Note that both web apps and DB are created as copy/paste (except geo-replication) from each other in an Azure ARM json file.
Both web apps are alwaysOn.
Thank you.
UPDATE
Actually there are several parts in action in what I see as a end user. The webapps and the dbs. I wrote this question thinking the issue was around the dbs and geo-replication however, after trying #Alberto's script (see below) I couldn,' see any differences in wait_times when querying Brazil or Europe so the problem may be on the webapps. I don't know how to further analyse/test that.
UPDATE 2
This may be (or not) related to query store. I asked on a new more specific question on that subject.
UPDATE 3
Queries on secondary database are not slower. My question was raised on false conclusions. I won't delete it as others took time to answer it and I thank them.
I was comparing query response times through rest calls to a web api running EF queries on a SQL Server DB. As rest calls to the web api located in the region querying the db replica are slower than rest calls to the same web api deployed in another region targeting the primary db, I concluded the problem was on the db side. However, when I run the queries in SSMS directly, bypassing the web api, I observe almost no differences in response times between primary and replica db.
I still have a problem but it's not the one raised in that question.
On Azure SQL Database your database' memory utilization may be dynamically reduced after some minutes of inactivity, and on this behavior Azure SQL differs from SQL Server on-premises. If you run a query two or three times it then start to execute faster again.
If you examine the query execution plan and its wait stats, you may find a wait named MEMORY_ALLOCATION_EXT for those queries executing after the memory allocation has been shrinked by Azure SQL Database service. Databases with a lot activity and query execution may not see its memory allocation reduced. For a detailed information of my part please read this StackOverflow thread.
Take in consideration also both databases should have the same service tier assigned.
Use below script to determine query waits and see what is the difference in terms of waits between both regions.
DROP TABLE IF EXISTS #before;
SELECT [wait_type], [waiting_tasks_count], [wait_time_ms], [max_wait_time_ms],
[signal_wait_time_ms]
INTO #before
FROM sys.[dm_db_wait_stats];
-- Execute test query here
SELECT *
FROM [dbo].[YourTestQuery]
-- Finish test query
DROP TABLE IF EXISTS #after;
SELECT [wait_type], [waiting_tasks_count], [wait_time_ms], [max_wait_time_ms],
[signal_wait_time_ms]
INTO #after
FROM sys.[dm_db_wait_stats];
-- Show accumulated wait time
SELECT [a].[wait_type], ([a].[wait_time_ms] - [b].[wait_time_ms]) AS [wait_time]
FROM [#after] AS [a]
INNER JOIN [#before] AS [b] ON
[a].[wait_type] = [b].[wait_type]
ORDER BY ([a].[wait_time_ms] - [b].[wait_time_ms]) DESC;
Over the past week or two, we've seen a four cases where our Azure SQL Database DTU graph ends up looking like this:
That is, it seems to "restart" (note that the graph consistently shows 0 DTUs before the spike, which was definitely not the case because we have constant traffic on this server). This seems to indicate that the DTU measurements restarted. The large spike, followed by the subsequent decaying and stabilizing DTU values seems to indicate to us that the database is "warming up" (presumably doing things like populating caches and organizing indexes perhaps?). The traffic to the web app that accesses this database showed nothing abnormal over the same time period, so we don't have any reason to think that this is a result of "high load".
The "Activity Log" item in Azure doesn't show any information. Looking at the "Resource Health" of our database, however, we saw the following:
Note the A problem with your SQL database has been resolved. The timestamp however doesn't exactly correspond to the time of the spike above (the graph is showing UTC+1 time, and presumably the resource-health timestamp is in UTC, so it's about 1.15hrs difference).
Clicking on "View History" gives us all such events for the past couple of weeks:
In each case the database is "available" again within the refresh-granularity (2 minutes), again suggesting restarts. Interestingly, the restarts are around 4 days apart in each case.
Of course I expect and understand that the database be moved around and restarted from time to time. Our web app is Asp.Net Core 2.0 and uses connection resiliency, so we don't have any failing requests.
That said, considering that this has been happening relatively frequently in the last few weeks, I'm of course wondering if this is something that needs action from our side. We did, for example, upgrade to Entity Framework Core 2.0 around 5 weeks ago, so I'm slightly concerned that that might have something to do with it.
My questions:
Is there any way to know for sure that the database server restarted? Is this information stored in the database itself anywhere, or perhaps on the master database?
Is there any way to know the reason for such restarts, and whether or not it's "our fault" or simply a result of hosting-environment changes? Does the Azure team make such information publicly available anywhere?
The database is on S3 Standard level (100 DTUs) and is hosted in South-East Asia. It's around 3.5GB in size.
Please enable Query Store to identify queries and statements involved on those spikes you see on the DTU consumption graph.
ALTER DATABASE [DB1] SET QUERY_STORE = ON;
Then use a query like below to identify long running queries and the tables involved with them. The name of the tables may give you an idea on what is creating those spikes.
SELECT TOP 10 rs.avg_duration, qt.query_sql_text, q.query_id,
qt.query_text_id, p.plan_id, GETUTCDATE() AS CurrentUTCTime,
rs.last_execution_time
FROM sys.query_store_query_text AS qt
JOIN sys.query_store_query AS q
ON qt.query_text_id = q.query_text_id
JOIN sys.query_store_plan AS p
ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats AS rs
ON p.plan_id = rs.plan_id
WHERE rs.last_execution_time > DATEADD(hour, -1, GETUTCDATE())
ORDER BY rs.avg_duration DESC;
About the downtimes logged on Resource Health, it seems they are related to maintenance tasks because they occur every 4 days, but I will report it to SQL Azure team and try to get a feedback.
On our Prod Database which is based on Oracle, I want to see the number of queries which get fired.
The reasoning behind is that we want to see the number of network calls we make and the impact firewall could make if we move it to a cloud system.
select sum(EXECUTIONS)
from v$sql
where last_active_time >= trunc(sysdate)-2
and (parsing_schema_name like '%\_RW%' escape '\' or parsing_schema_name = 'TEMP_USER')
and module not in ('DBMS_SCHEDULER')
and sql_text not like '%v$sql%';
Above query doesn't seem very reliable due to SQLs being pushed out of memory which is what the above one returns.
Is there any way to get the number of calls we make on our Oracle DB from the database itself? Logging from all the applications is not a feasible option at the moment.
Thanks!
"we want to see the number of network calls we make and the impact firewall could make if we move it to a cloud system"
The number of SQL statements executed is only tangentially related to the amount of network traffic. Compare the impact of select * from dual with select * from humongous_table.
A better approach might be to talk with your network admin and see what they can tell you about the traffic your applications generate. Alternatively download Wireshark and see for yourself (providing your security team is cool with that).
Just to add some information on V$SQL views:
V$SQLAREA has the lowest retention, and shows the current SQL in memory, parsed, and ready for execution.
V$SQL has better retention, and is updated every 5 seconds after query execution.
V$SQLSTATS has the best retention, and retains SQL even after the cursor has been aged out of the shared pool.
You don't want to run these queries too often on busy production databases, as they can add to shared pool fragmentation.
We have a system that is concurrently inserted a large amount of data from multiple stations while also exposing a data querying interface. The schema looks something like this (sorry about the poor formatting):
[SyncTable]
SyncID
StationID
MeasuringTime
[DataTypeTable]
TypeID
TypeName
[DataTable]
SyncID
TypeID
DataColumns...
Data insertion is done in a "Synchronization" and goes like this (we only insert data into the system, we never update)
INSERT INTO SyncTable(StationID, MeasuringTime) VALUES (X,Y); SELECT ##IDENTITY
INSERT INTO DataTable(SyncID, TypeID, DataColumns) VALUES
(SyncIDJustInserted, InMemoryCachedTypeID, Data)
... lots (500) similar inserts into DataTable ...
And queries goes like this ( for a given station, measuringtime and datatype)
SELECT SyncID FROM SyncTable WHERE StationID = #StationID
AND MeasuringTime = #MeasuringTime
SELECT DataColumns FROM DataTable WHERE SyncID = #SyncIDJustSelected
AND DataTypeID = #TypeID
My question is how can we combine the transaction level on the inserts and NOLOCK/READPAST hints on the queries so that:
We maximize the concurrency in our system while favoring the inserts (we need to store a lot of data, something as high as 2000+ records a second)
Queries only return data from "commited" synchronization (we don't want a result set with a half inserted synchronization or a synchronization with some skipped entries due to lock-skipping)
We don't care if the "newest" data is included in the query, we care more for consistency and responsiveness then for "live" and up-to-date data
This may be very conflicting goals and may require a high transaction isolation level but I am interested in all tricks and optimizations to achieve high responsiveness on both inserts and selects. I'll be happy to elaborate if more details are needed to flush out more tweaks and tricks.
UPDATE: Just adding a bit more information for future replies. We are running SQL Server 2005 (2008 within six months probably) on a SAN network with 5+ TB of storage initially. I'm not sure what kind of RAID the SAn is set up to and precisely how many disks we have available.
If you are running SQL 2005 and above look into implementing snapshot isolation. You will not be able to get consistent results with nolock.
Solving this on SQL 2000 is much harder.
This is a great scenario for SQL Server 2005/2008 Enterprise's Partitioning feature. You can create a partition for each StationID, and each StationID's data can go into its own filegroup (if you want, may not be necessary depending on your load.)
This buys you some advantages with concurrency:
If you partition by stationid, then users can run select queries for stationid's that aren't currently loading, and they won't run into any concurrency issues at all
If you partition by stationid, then multiple stations can insert data simultaneously without concurrency issues (as long as they're on different filegroups)
If you partition by syncid range, then you can put the older data on slower storage.
If you partition by syncid range, AND if your ranges are small enough (meaning not a range with thousands of syncids) then you can do loads at the same time your users are querying without running into concurrency issues
The scenario you're describing has a lot in common with data warehouse nightly loads. Microsoft did a technical reference project called Project Real that you might find interesting. They published it as a standard, and you can read through the design docs and the implementation code in order to see how they pulled off really fast loads:
http://www.microsoft.com/technet/prodtechnol/sql/2005/projreal.mspx
Partitioning is even better in SQL Server 2008, especially around concurrency. It's still not a silver bullet - it requires manual design and maintenance by a skilled DBA. It's not a set-it-and-forget-it feature, and it does require Enterprise Edition, which costs more than Standard Edition. I love it, though - I've used it several times and it's solved specific problems for me.
What type of disk system will you be using? If you have a large striped RAID array, writes should perform well. If you can estimate your required reads and writes per second, you can plug those numbers into a formula and see if your disk subsystem will keep up. Maybe you have no control over hardware...
Wouldn't you wrap the inserts in a transaction, which would make them unavailable to the reads until the insert is finished?
This should follow if your hardware is configured correctly and you're paying attention to your SQL coding - which it seems you are.
Look into SQLIO.exe and SQL Stress tools:
SQLIOStress.exe
SQLIOStress.exe simulates various patterns of SQL Server 2000 I/O behavior to ensure rudimentary I/O safety.
The SQLIOStress utility can be downloaded from the Microsoft Web site. See the following article.
• How to Use the SQLIOStress Utility to Stress a Disk Subsystem such as SQL Server
http://support.microsoft.com/default.aspx?scid=kb;en-us;231619
Important The download contains a complete white paper with extended details about the utility.
SQLIO.exe
SQLIO.exe is a SQL Server 2000 I/O utility used to establish basic benchmark testing results.
The SQLIO utility can be downloaded from the Microsoft Web site. See the following:
• SQLIO Performance Testing Tool (SQL Development) – Customer Available
http://download.microsoft.com/download/f/3/f/f3f92f8b-b24e-4c2e-9e86-d66df1f6f83b/SQLIO.msi