We're running SqlServer 2019 on google cloud linux VM (2vcpu, 4gb), using mcr.microsoft.com/mssql/server:latest image. Might be of interest; it's enterprise edition. I understand that 2vcpu & 4gb is not really sufficient for any kind of SqlServer workload, but its currently just a poc/test environment, but shouldn't be relevant as no queries are executed anyway. (maybe?)
Instance hosts about 10 databases, each smaller than the other (few megs each), but databases aren't queried at all. Only external queries executed are SELECT 1 healthchecks (from dotnet core apps).
After few weeks of slow gcloud dashboard movement (cpu ~ 1% at all times), few days ago cpu started to behave oddly. It sticks to 60% most of the time (+/- few %). Pattern is visible on the image below; it sticks to ~60% for 30 min, then drops for few mins, and goes back up to 60%. It once went higher, but most of the time its steady 60%:
When i ssh to the instance, top looks like this:
Additional information;
MAXDOP is set to 2
Memory is not limited (mistake on my part, should be ~3gb, considering 4gb vm?)
Other config: default
I've run query shown here: techcommunity.microsoft.com and got results that look like old 2008 issue: constant resource monitor values 32-33.
How can i find out whats eating the cpu?
I'm just curious in which direction to look... Is it resource monitor-related? Linux? Docker? Bad configuration? Too weak VM?
Edit: reason why i haven't increased VM resources or set ram limits so far is because it took few weeks for this issue to occur, I dont want to make it disappear by restarting the VM. We have 2 more VMs with the same setup, still behaving as expected (~1% cpu).
Edit 2: I added monitoring snapshots of last 48h for this and 2 other VMs running the similar workload - below.
Column 1: VM1 - topic of the discussion
Column 2: VM2 - behaves as expected
Column 3: VM3 - behaves as expected
Memory footprint of all instances is the same:
VIRT = 12.5g
RES = 2.5-3.0g
%MEM = 65%
Related
We're using AWS Lightsail PostgreSQL Database. We've been experiencing errors with our C# application timing out when using the connection to database. As I'm trying to debug the issue, I went to look at the Metric graphs in AWS. I noticed that many of the graphs have frequent gaps in the data, labeled No data available. See image below.
This graph (and most of the other metrics) shows frequent gaps in the data. I'm trying to understand if this is normal, or could be a symptom of the problem. If I go back to 2 weeks timescale, there does not appear to be any other strange behaviors in any of the metric data. For example, I do not see a point in time in the past where the CPU or memory usage went crazy. The issue started happening about a week ago, so I was hoping the metrics would have helped explained why the connections to the PostgreSQL database are failing from C#.
๐ถ So I guess my question is, are those frequent gaps of No data available normal for a AWS Lightsail Postgres Database?
Other Data about the machine:
1 GB RAM, 1 vCPU, 40 GB SSD
PostgreSQL database (12.11)
In the last two weeks (the average metrics show):
CPU utilization has never gone over 20%
Database connections have never gone over 35 (usually less than 5) (actually, usually 0)
Disk queue depth never goes over 0.2
Free storage space hovers around 36.5 GB
Network receive throughput is mostly less than 1 kB/s (with one spike to 141kB/s)
Network transmit throughput is mostly less than 11kB/s with all spikes less than 11.5kB/s
I would love to view the AWS logs, but they are a month old, and when trying to view them they are filled with checkpoint starting/complete logs. They start at one month ago and each page update only takes me 2 hours forward in time (and taking ~6 seconds to fetch the logs). This would require me to do ~360 page updates, and when trying, my auth timed out. ๐ข
So we never figured out the reason why, but this seems like it was a problem with the AWS LightSail DB. We ended up using a snapshot to create a new clone of the DB, and wiring the C# servers to the new DB. The latency issues we were having disappeared and the metric graphs looked normal (without the strange gaps).
I wish we were able to figure out the root of the problem. ATM, we are just hoping the problem does not return.
When in doubt, clone everything! ๐
I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~
I am running a 10 node Vertica testbed, with each node having 256GB of RAM, and 40 cpu core. When I look at the resource utilisation memory/cpu/IO in vertica management console (MC), I do not see any unusual activity in MC.
CASE A: 1 User Test
I ran single user Jmeter JDBC test, looked at system resource, there hardly any resource-utilization (cpu/memory/IO within 20%)
CASE A: 5 User Test
I ran 5 user Jmeter JDBC test, looked at system resource, there hardly any resource-utilization (cpu/memory/IO within 20%)
However, the 5-user test took about 3-times more than single-user test to execute the queries.
NOTE: these being a SELECT QUERY benchmarking, there was no exclusive-locks
despite being above case why is there a huge time difference?
I am looking for the direction to be followed in such scenario, I have done all the basic fact-checking that Vertica documentation is asking us to do (i.e. looking into query_request, dc_allocation_*, resource_acquisition etc. table for finding the casue).
I switched from a join and view based strategy on my old dedicated server to a multiple small queries strategy for Google Cloud. For me this is also easier to maintain and on my dev machine there was no noticeable performance difference. But on my App Engine and Cloud SQL it is really slow.
For example if I want to query the last 50 articles it takes 4-5 seconds, on my dev machine 160ms. For each article there are min. 12 queries on average 15 queries. That are ~750 queries, if I monitor the Cloud SQL I noticed that it always caps at ~200 queries per second. The CPU just peeks at 20%, I have just a db-n1-standard-1 with SSD. 200 queries per second also mean if I want to get the last 100 articles it will take 8-9 seconds and so on.
I already tried to set the App Engine Instance class to F4 to see if this will change anything. It didn't change anything, the number where the same. I haven't tried to increase the DB Instance because I can't see that it is at it's limit.
What to do?
Software: I use GO with a unlimited mysql connection pool.
EDIT: I even changed to the db-n1-standard-2 instance and there was no difference :(
EDIT2: I tried some changes over the weekend, 1500 iops, 4 cores, etc but nothing showed the expected improvements. The usage graphs were already indicating that there is no "hardware" limit. I managed to isolate the slow query tho... it was a super simple one where I query the country name via country-ISO2 and language-ISO3 both keys are indexed and still it takes 50ms for EACH. So I just cached ALL countires in memcache and done.
Google Cloud SQL uses GCE VM instances so things that apply to GCE apply to Cloud SQL.
When you create a db-n1-standard-1 instance your network throughput is caped to 250 MB/s by your CPU but your Read/Write (a)disk throughtput and (b)IOPS speed are capped by the the storage capacity and type, which is:
Write: 4.8แต
|300แต
Read: 4.8แต
|300แต
You can view what if anything is missing in your instance details:
https://console.cloud.google.com/sql/instances/[INSTANCE_NAME]
If you want to increase performance of your instance raise the number of its vCPUs and its storage capacity/type as sugested in the links above.
We have an SQL server with about 40 different (about 1-5GB each) databases. The server is an 8 core 2.3G CPU with 32Gigs of RAM. 27Gig is pinned to SQL Server. The CPU utliziation is mostly close to 100% always and memory consumption is about 95%. The problem here is the CPU which is constantly close to 100% and trying to understand the reason.
I have run an initial check to see which database contributes to high CPU by using - this script but I could not substantiate in detail on whats really consuming CPU. The top query (from all DBs) only takes about 4 seconds to complete. IO is also not a bottleneck.
Would Memory be the culprit here? I have checked the memory split and the OBJECT CACHE occupies about 80% of memory allocated (27G) to SQL Server. I hope that is normal provided there are lot of SPs involved. Running profiler, I do see lot of recompiles, but mostly are due to "temp table changed", "deferred compile" etc and am not clear if these recompiles are a result of plans getting thrown out of cache due to memory pressure
Appreciate any thoughts.
You can see some reports in SSMS:
Right-click the instance name / reports / standard / top sessions
You can see top CPU consuming sessions. This may shed some light on what SQL processes are using resources. There are a few other CPU related reports if you look around. I was going to point to some more DMVs but if you've looked into that already I'll skip it.
You can use sp_BlitzCache to find the top CPU consuming queries. You can also sort by IO and other things as well. This is using DMV info which accumulates between restarts.
This article looks promising.
Some stackoverflow goodness from Mr. Ozar.
edit:
A little more advice...
A query running for 'only' 5 seconds can be a problem. It could be using all your cores and really running 8 cores times 5 seconds - 40 seconds of 'virtual' time. I like to use some DMVs to see how many executions have happened for that code to see what that 5 seconds adds up to.
According to this article on sqlserverstudymaterial;
Remember that "%Privileged time" is not based on 100%.It is based on number of processors.If you see 200 for sqlserver.exe and the system has 8 CPU then CPU consumed by sqlserver.exe is 200 out of 800 (only 25%).
If "% Privileged Time" value is more than 30% then it's generally caused by faulty drivers or anti-virus software. In such situations make sure the BIOS and filter drives are up to date and then try disabling the anti-virus software temporarily to see the change.
If "% User Time" is high then there is something consuming of SQL Server.
There are several known patterns which can be caused high CPU for processes running in SQL Server including