I created an AWS CloudWatch alarm for Aurora PostgreSQL's Freeable Memory and wanted to test if it is created correctly. So looking for any queries to be executed on the Aurora PostgreSQL 12.8 which can increase its memory usage to say 70 or 80% and activate the CloudWatch Alarm.
The DB is completely empty and we can create DB's/Tables etc. and run any queries we want.
I cannot say about Aurora, but on PostgreSQL you could do something like:
SET work_mem = '1TB';
SELECT * FROM generate_series(1, 100000000000000000000000000000000);
Related
We have a read-only PostgreSQL RDS database which is heavily queried. We don't perform any inserts/updates/deletes during normal traffic, but still we can see how we are running out of Free Storage Space and an increase on Write IOPS metric. During this period, CPU usage is at 100%.
At some point, the storage space seems to be released.
Is this expected?
The issue was in the end related to our logs. log_statement was set to all, where every single query to PG would be log. In order to troubleshoot long time queries, we combined log_statement and log_min_duration_statement.
Since this is a read only database we want to know if there is any insert/update/delete operation, so log_statement: dll ; and we want to know which queries are taking longer than 1s: log_min_duration_statement: 1000
Objective
I have Apache Nifi Docker container on Azure VM with attached premium very high-throughput SSD disk. I have MSSQL Server 2012 database on AWS. Nifi to database communication happens through mssql jar v6.2, and through high-throughtput AWS Direct Connect MPLS network.
Within Nifi Flow only one processor is executed - ExecuteSQLRecord. It use only one thread/CPU and has 4 GB JVM Heap Space available. ExecuteSQLRecord execute query that return 1 million of rows, which equals to 60MB Flow File. Query is based on table indexes, so there is nothing to optimize on DB side. Query looks like: SELECT * FROM table WHERE id BETWEEN x AND y.
The issue
ExecuteSQLRecord with 1 thread/CPU, 1 query , retrieves 1M of rows (60MB) in 40 seconds.
In the same time, the same query run from MSSMS and database internal network takes 18 seconds.
In the same time query is already optimized on DB side (with indexes), and throughtput scale linearly with increasing number of threads/CPUs - network is not a bottleneck.
Questions
Is this performance okay for Nifi 1 CPU? Is it okay that Nifi spends 22 seconds (from 40) for retrieval and storing the results to Content Repository?
How does Nifi pull the data from MSSQL Server? Is this a pull approach? If yes, maybe we have to many roundtrips?
How can I check how much time Nifi spending on converting result set to CSV, and how much time for writting into Content Repository?
Are you using the latest Docker image (1.11.4)? If so you should be able to set the fetch size on the ExecuteSQLRecord processor (https://issues.apache.org/jira/browse/NIFI-6865)
I got a couple of different results when I searched for the default fetch size for the MSSQL driver, one site said 1 and another said 32. In your case for that many records I'd imagine you'd want it to be way higher (see https://learn.microsoft.com/en-us/previous-versions/sql/legacy/aa342344(v=sql.90)?redirectedfrom=MSDN#use-the-appropriate-fetch-size for setting the appropriate fetch size).
To add to Matt's answer, you can examine the provenance data for each flowfile and see the lineage duration (amount of time) it spent in each segment of the flow. You can also see the status history for every processor, so you can examine the data in/out by size and number of flowfiles, CPU usage, etc. for each processor.
The DBs corresponding to the different environments (e.g. production, staging) of our web app are on the same Azure SQL database server. While I read everywhere that an Azure SQL server is just a logical container (and the DBs on it may not even be on the same physical machine) we see signs of DBs behaving as noisy neighbors, i.e. operations made on one of them affecting the performance of others.
We've seen the below operations and metrics happening correlate between DBs. Let's call one DB "prod" and the other "stage"; in all cases stage was created by copying prod with the Start-AzureWebAppSqlDatabaseCopy PowerShell commandlet.
Scaling up stage correlates with Data IO spike on prod.
Running performance-heavy operations on stage (deleting thousands of tables, updating around 10 thousand rows) correlates with SQL connection timeouts ("The timeout period elapsed prior to completion of the operation or the server is not responding.") and Data IO spikes on prod.
With both DBs we use separate DB-level user accounts (on why, see this SO post), but the prod and stage user accounts both exist under both DBs (i.e. we use the stage user to connect to the stage DB, but the stage user also exists under the prod DB, and the prod user also exists under the stage DB). We dropped the stage user from the prod DB to see if that makes a difference, but it didn't.
It may be worth noting that when the Web/Business Azure SQL tiers were phase out these DBs were migrated from Web to their current S1 tier. We see the same issue with DBs on another server too. The DBs are NOT part of an elastic pool.
Our findings are inconclusive and these events don't correlate 100% of the time either. We're out of ideas what to investigate, as we are sure that the stage app doesn't connect to the prod DB. We tried to find evidence of the stage app somehow affecting the prod DB but we couldn't. Any input would be appreciated.
Update 1
Using Grant's sys.dm_os_wait_stats tip, as well as sys.dm_os_performance_counters it is evident that yes, if you make a copy of the database on the same logical server it will be created on the same physical SQL Server too. Server name in object_name is the same, wait values are exactly the same.
This doesn't explain however, why operations on the copy affect the original DB. Since it seems that the noisy neighbor effect doesn't happen all the time (the scale up does affect the original DB most of the time, the perf-heavy operations less so, but the correlation is still pronounced) it might be some random Azure problem.
We'll see if using a different logical server fixes the issue. What's sure is that in that case the physical server will also be different, we've checked that.
Update 2
We're monitoring the situation but whether this indeed solves the issue will most possibly be apparent only after several months. For now we have put all DBs on separate servers.
We did notice timeouts on the prod DB always in the same time interval after all operations on the stage DB completed. These timeouts however seem to only happen for table creations. It's like after copying the prod DB to the stage DB the prod DB is somewhat "locked" for a period of time (about 45-60 minutes) and you can't create tables (but you can drop them, those work). Funnily enough this didn't happen today, so maybe it has resolved itself...
From the information you provide I suspect the issue is the workload of your databases is occasionally I/O intensive, is hitting the tier limits and Azure SQL starts to throttling. That throttling may be behind those timeouts.
Please monitor resource consumption using below query:
SELECT
(COUNT(end_time) - SUM(CASE WHEN avg_cpu_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'CPU Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_log_write_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Log Write Fit Percent'
,(COUNT(end_time) - SUM(CASE WHEN avg_data_io_percent > 80 THEN 1 ELSE 0 END) * 1.0) / COUNT(end_time) AS 'Physical Data Read Fit Percent'
FROM sys.dm_db_resource_stats
Service level objective (SLO) of 99.9% <= go to next tier.
Measure DTU consumption over time. Are you getting timeouts when the following query shows high DTU usage.
SELECT start_time, end_time,
(SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent),
(avg_log_write_percent)) AS value(v)) as [avg_DTU_percent]
FROM sys.resource_stats where database_name = 'AdventureWorksLT' order by end_time desc
Compare DTU usage vs DTU limit.
SELECT
end_time AS [EndTime]
, (SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v)) AS [AvgDTU_Percent]
, ((dtu_limit)*((SELECT Max(v) FROM (VALUES (avg_cpu_percent), (avg_data_io_percent), (avg_log_write_percent)) AS value(v))/100.00)) AS [AvgDTUsUsed]
, dtu_limit AS [DTULimit]
FROM sys.dm_db_resource_stats
The way I would go about determining if this was the case would be to use sys.dm_os_wait_stats along with sys.dm_db_wait_stats. The OS wait stats are for the "server" that your database is running on and the db wait stats are for the database. Gather the db waits for both databases in question and the os waits for both databases. First, straight up compare the os waits. if they're the same (with some margin, I wouldn't expect them to be exactly the same, although, if they are, there's your answer), you may be seeing everything on the same server. If they're not really the same, but sort of similar, then compare the db wait stats for each database to the OS wait stats and see if you can see a direct correlation.
For management purposes only, I would probably separate these on to different servers anyway, even if this wasn't normally an issue. However, if you can find a correlation, then the best bet is probably to break apart the servers. It won't cost you anything. You pay for the database, not the server.
On our Prod Database which is based on Oracle, I want to see the number of queries which get fired.
The reasoning behind is that we want to see the number of network calls we make and the impact firewall could make if we move it to a cloud system.
select sum(EXECUTIONS)
from v$sql
where last_active_time >= trunc(sysdate)-2
and (parsing_schema_name like '%\_RW%' escape '\' or parsing_schema_name = 'TEMP_USER')
and module not in ('DBMS_SCHEDULER')
and sql_text not like '%v$sql%';
Above query doesn't seem very reliable due to SQLs being pushed out of memory which is what the above one returns.
Is there any way to get the number of calls we make on our Oracle DB from the database itself? Logging from all the applications is not a feasible option at the moment.
Thanks!
"we want to see the number of network calls we make and the impact firewall could make if we move it to a cloud system"
The number of SQL statements executed is only tangentially related to the amount of network traffic. Compare the impact of select * from dual with select * from humongous_table.
A better approach might be to talk with your network admin and see what they can tell you about the traffic your applications generate. Alternatively download Wireshark and see for yourself (providing your security team is cool with that).
Just to add some information on V$SQL views:
V$SQLAREA has the lowest retention, and shows the current SQL in memory, parsed, and ready for execution.
V$SQL has better retention, and is updated every 5 seconds after query execution.
V$SQLSTATS has the best retention, and retains SQL even after the cursor has been aged out of the shared pool.
You don't want to run these queries too often on busy production databases, as they can add to shared pool fragmentation.
I have a not simple query.
When I had 10 DTUs for my database, it took about 17 seconds to execute the query.
I increased the level to 50 DTU - now the execution takes 3-4 seconds.
This ratio corresponds to the documentation - more DTU = work faster.
But!
1 On my PC I can execute the query in 1 sec.
2 In portal-statistics I see that I use only 12 DTU (max DTU percentage = 25% ).
In sys.dm_db_resource_stats I see that MAX(avg_cpu_percent) is about 25% and the other params are less.
So the question is - Why my query takes 3-4 sec to exec?
It can be executed in 1 sec. And server does not use all my DTU.
How to make server use all available resources to exec queries faster?
DTU is a combined measurement of CPU, memory, data I/O and transaction log I/O.
This means that reaching a DTU bottleneck can mean any of those.
This question may help you to measure the different aspects: Azure SQL Database "DTU percentage" metric
And here's more info on DTU: https://learn.microsoft.com/en-us/azure/sql-database/sql-database-what-is-a-dtu
On my PC I can execute the query in 1 sec
We should not be comparing our Onprem computing power with DTU.
DTU is a combination of CPU,IO,Memory you will be getting based on your performance tier.so the comparison is not valid.
How to make server use all available resources to exec queries faster?
This is simply not possible,since when sql runs a query,memory is the only constraint ,that can prevent the query from even starting.Rest of the resources like CPU,IO speed can increase or decrease based on what query does
In summary,you will have to ensure ,queries are not constrained due to resource crunch,they can use up all resources if they need and can release them when not needed.
You also will have to look at wait types and further fine tune the query.
As Bernard Vander Beken mentioned:
DTU is a combined measurement of CPU, memory, data I/O and transaction
log I/O.
I'll also add that Microsoft does not share the formula used to calculate DTUs. You mentioned that you are not seeing DTUs peg at 100% during query execution. But since we do not know the formula, you may very well be pegging components of DTU, but not pegging DTU itself.
Azure SQL is a shared environment, and each tenant will be throttled to ensure that the minimum SLA for all tenants
What a DTU is is quite fuzzy.
We have done an experiment where we run a set of benchmarks on machines with the same amount of DTU on different data centers.
http://dbwatch.com/azure-database-performance-measured
It turns out that the actual performance varies by a factor of 5.
We have also seen instances where the performance of a repeated query on the same database varies drastically.
We provide our database performance benchmarks for free if you would like to compare the instance you run on your PC with the instance in the azure cloud.