I have an application that I'd like to make more efficient - it isn't taxing any one resource enough that I can identify it as a bottleneck, so perhaps the app is doing something that is preventing full efficiency.
The application pulls data from a database on one SQL Server instance, does some manipulation on it, then writes it to a database on another SQL Server instance - all on one machine. It doesn't do anything in parallel.
While the app is running (it can take several hours), none of the 4 CPU cores are maxed out (they hover around 40-60% utilization each), the disks are almost idle and very little RAM is used.
Reported values:
Target SQL Server instance: ~10% CPU utilization, 1.3GB RAM
Source SQL Server instance: ~10% CPU utilization, 300MB RAM
Application: ~6% CPU utilization, 45MB RAM
All the work is happening on one disk, which writes around 100KB/s during the operation, on average. 'Active time' according to task manager is usually 0%, occasionally flickering up to between 1 and 5% for a second or so. Average response time, again according to task manager, moves betweeen 0ms and 20ms, mainly showing between 0.5 and 2ms.
Databases are notorious for IO limitations. Now, seriously, as you say:
The application pulls data from a database on one SQL Server instance,
does some manipulation on it, then writes it to a database on another
SQL Server instance - all on one machine.
I somehow get the idea this is a end user level mashine, maybe a workstation. Your linear code (a bad idea to get full utilization btw, as you never run all 3 parts - read, process, write - in parallel) will be seriously limited by whatever IO subsystem you have.
But that will not come into play as long as you can state:
It doesn't do anything in parallel.
What it must do is do things in parallel:
One task is reading the next data
One task does the data processing
One task does the data writing
You can definitely max out a lot more than your 4 cores. Last time I did something like that (read / manipulate / write) we were maxing out 48 cores with around 96 or so processing threads running in parallel (and a smaller amount doing the writes). But a core of that is that your application msut start actually using multiple CPU's.
If you do not parallelize:
You only will max out one core max,
YOu basically waste time waiting for databases on both ends. The latency while you wait for data to be read or committed is latency you are not processing anything.
;) And once you fix that you will get IO problems. Promised.
I recommend reading How to analyse SQL Server performance. You need to capture and analyze the wait stats. These will tell you what is the execution doing that prevents it from going max out on CPU. You already have a feeling that the workload is causing the SQL engine to wait rather than run, but only after you understand the wait stats you'll be able to get a feel what is waiting for. Follow the article linked for specific analysis techniques.
Related
I have been using Clickhouse at work for analytics purposes for a while now.
I am currently running Clickhouse v22.6.3 revision 54455 on-premise on a VM with:
fast storage
200Gb of RAM
no swap
a 40-cores CPU.
I have a few Tb of data, but no table bigger than 300 Gb. I do not use distributed tables or replication yet, and I write frequently into Clickhouse (but I don't use deletes or updates and prefer using things like the ReplacingMergeTree engine). I also leverage the MaterializedView feature for a few tables. Let me know if you need any more context or parameter, I use a pretty standard configuration.
Now, for a few months I have been experiencing performances issues where the server significantly slows down every day at 10am, and I cannot figure out why.
Based on Clickhouse built-in Graphite monitoring, the "symptoms" of the issue seem to be as follow:
At 10am:
On the server side:
Both load and RAM usage remain reasonable. Load goes up a little.
Disk write await time goes up (which I suspect is what leads to higher load)
Disk utilization % skyrockets to something between 90 and 100%
On Clickhouse side:
DiskSpaceReservedForMerge stays roughly the same (ie between 0 and 70Gb)
both OpenFileForRead and OpenFileForWrite go up by a factor of ~2
BackgroundCommonPoolTask goes slightly up, so does BackgroundSchedulePoolTask (which I found weird, because I thought this pool was dedicated to distributed operations - which I don't use) - both numbers remain seemingly reasonable
The number of active Merge tasks per minutes drop significantly but I'm unsure whether it's a consequence of slow writing or if it's causing it
both insert and general querying time are multiplied by ~10 which renders the database effectively unusable even for small tasks
Restarting Clickhouse usually fixes the problem but I obviously do not want to restart my main database every day at 10am. Most of the heavy load I put on the DB (such as data extraction and transformation, etc) happens earlier in the morning (and end around 7-8am) and runs fine. I do not have any heavy tasks running at 10am. The Clickhouse VM takes most of its host resources and I have confirmed with the devOps team that there doesn't seem to be a problem on the host or anything else scheduled on it at that time.
Is there any kind of background tasks or process that is run by Clickhouse on a daily basis and that could have a high impact on our disk capacity? What else can I monitor to figure out what is causing this problem?
Again, let me know if I can be more thorough on our settings and the state of the DB when the "bug" occurs.
Do you use https://github.com/innogames/graphite-ch-optimizer ?
Do you use TTL ?
select * from system.merges;
select * from system.part_log where event_time between ~10am~
I have a simple ssis ETL moving data from one server to another, a couple of the tables are 150gb+. For these larger tables how do I optimize the MaximumInsertCommitSize, on the server that is being loaded, I see the the the ram utilization is near 100% (64 gb) for the duration of the load for these tables. I am also getting pageiolatch_ex suspensions on the server that is being loaded.
This leads me to believe that the MaximumInsertCommitSize needs to be brought down.
my first question is whether you agree with this?
my other questions are more open ended
How do I optimize it by trial and error when it takes an hour to load the table (and table size matters for this operation so I would have to load all of it) ?
would network speed ever play into this optimization due to the increased bandwidth with multiple ?partitions? of data
would the hard drive speed affect it as this is the servers cheapest component (facepalm) - my thinking here is page io latch memory waits indicate that the number of disk operation is minimized but since the ram is over utilized sending tasks to the drive instead of waiting would be better.
Finally is there a calculator for this, I feel that even vague approximation of MaximumInsertCommitSize would be great (just based on network, disk, ram, file size, with assumptions that the destination is partitioned and has ample disk space)
I'm running soak tests at the moment and keep coming up against a wierd issue that I've never seen in the past. I've spent quite a while investigating the issue and so far not got to be bottom of it.
At some point during the test (sometimes 1 hour in, other times 4+ hours) the SQL Server machine starts maxing it's CPU. This always corresponds with a sharp decrease in DB cache memory and increase in free memory.
The signs obviously point at memory pressure and it seems that I can sometimes trigger this event by running a particularly heavy query.
I can understand why the plan cache is being flushed however the aspects of this that are confusing me are:
After the plan cache is flushed and my meaty query finishes there is plenty of free memory (even after further increasing the amount of memory SQL Server is allowed) the plan cache doesn't seem to recover. I'm left with loads of free memory which isn't helping anyone.
If I stop my soak test and then re-run it immediatly then things go back to normal, the plan cache grows as expected. SQL Server does not need restarted or to have any settings altered.
After the cache flush the cache hit ratio is still OK-ish, ~90% however this is much lower than the ~99% I am seeing before the flush and really hurting the CPU.
Before the flush a trace of cache misses, inserts and hits looks normal enough. Pre-flush the only issue I see is a non-parameterised ad-hoc query that's being inserted into the cache very frequently however even with this it's a very simple query which has a low cost so would expect these to be flushed from the cache ahead of most other things.
Post flush I'm seeing a very high number of inserts followed immediately by numerous misses on the same object (i.e. stored procedures), and thus memory consumption for the cache remains low.
You can see from the yellow line in the shot of my counters below that the cache memory usage drops off and stays low yet the free memory (royal blue) stays fairly high.
EDIT
After looking into this issue for another good while a pattern that keeps appearing is that if I push the server to it's limit for a short time (adding load above what the soak test is producing) then SQL Server seems to get itself into a mess which it can't recover from on it's own.
The number of connections to the server sharply increases when it hits the point of maximum pressure (I'm assuming due to it not being able to deal with requests quickly enough so new connections are needed to deal with the "constant" flow of requests). This backlog is then placing further pressure on the server which it doesn't appear to be able to recover from.
Now, I'm still puzzled by the metrics. I could accept this as purely a server resource issue if the new connections seemed to be eating up memory, further slowing processing, causing new connections, etc. What I am seeing though is that there is plenty of free memory but SQL Server isn't using it for the plan cache. Because of this it's spending more time compiling, upping CPU and things spiral out of control.
It feels like the connections are the key part of this problem. As mentioned before if I restart the test everything goes back to normal. I've since found that putting the DB into single user mode for a few seconds so that all test related connections die, waiting a few seconds and then going back to multi-user mode resolves the issue. I've tried just killing all active connections based on SPID however it seems there needs to be a pause of a few seconds in order for the server to recover and start using the plan cache properly.
See screenshot below of my counters. I'm trying to push the server over the top up to ~02:33:15 and I set to single user mode at ~02:34:30 and then multi-user mode a few seconds after.
Purple line is user connections, thick red is compilations p/s, bright green is cache memory, aqua connection memory, greyish/brown is free memory.
OK, it's been a long circular road but the best answer I currently have for this is that this issue is due to resource constraints and the unfortunate choices that SQL Server makes in relation to the plan cache for my particular circumstances. I'm not saying SQL Server is wrong, just that for my needs at this time I don't think it's making the right decisions.
I've adjusted my soak test so that if the DB server comes under pressure it pulls on the reigns a bit and drops some connections, until such time that the server comes back under control and the additional connections can be reestablished. The process of SQL Server getting itself back in order can take a few minutes but it does happen!
It seems that the server was getting itself into a vicious cycle, where it was coming under pressure, dropping cached plans and then having to spend more on recompiling these plans later than it gained by dropping them in the first place. This lead to things spiraling out of control and everything grinding to a halt.
In my particular case there is a very high cache hit ratio (above 99.5%) and due to the soak test basically doing the same thing repeatedly for hours for loads of users the cache is very well used. If the cache weren't so well used then SQL Server would have quite possibly made the right choice by dropping plans but I don't think it did here.
We have an SQL server with about 40 different (about 1-5GB each) databases. The server is an 8 core 2.3G CPU with 32Gigs of RAM. 27Gig is pinned to SQL Server. The CPU utliziation is mostly close to 100% always and memory consumption is about 95%. The problem here is the CPU which is constantly close to 100% and trying to understand the reason.
I have run an initial check to see which database contributes to high CPU by using - this script but I could not substantiate in detail on whats really consuming CPU. The top query (from all DBs) only takes about 4 seconds to complete. IO is also not a bottleneck.
Would Memory be the culprit here? I have checked the memory split and the OBJECT CACHE occupies about 80% of memory allocated (27G) to SQL Server. I hope that is normal provided there are lot of SPs involved. Running profiler, I do see lot of recompiles, but mostly are due to "temp table changed", "deferred compile" etc and am not clear if these recompiles are a result of plans getting thrown out of cache due to memory pressure
Appreciate any thoughts.
You can see some reports in SSMS:
Right-click the instance name / reports / standard / top sessions
You can see top CPU consuming sessions. This may shed some light on what SQL processes are using resources. There are a few other CPU related reports if you look around. I was going to point to some more DMVs but if you've looked into that already I'll skip it.
You can use sp_BlitzCache to find the top CPU consuming queries. You can also sort by IO and other things as well. This is using DMV info which accumulates between restarts.
This article looks promising.
Some stackoverflow goodness from Mr. Ozar.
edit:
A little more advice...
A query running for 'only' 5 seconds can be a problem. It could be using all your cores and really running 8 cores times 5 seconds - 40 seconds of 'virtual' time. I like to use some DMVs to see how many executions have happened for that code to see what that 5 seconds adds up to.
According to this article on sqlserverstudymaterial;
Remember that "%Privileged time" is not based on 100%.It is based on number of processors.If you see 200 for sqlserver.exe and the system has 8 CPU then CPU consumed by sqlserver.exe is 200 out of 800 (only 25%).
If "% Privileged Time" value is more than 30% then it's generally caused by faulty drivers or anti-virus software. In such situations make sure the BIOS and filter drives are up to date and then try disabling the anti-virus software temporarily to see the change.
If "% User Time" is high then there is something consuming of SQL Server.
There are several known patterns which can be caused high CPU for processes running in SQL Server including
I'm working on a real-time video analysis system which processes the video stream frame by frame. At each frame it can generate several events which should be recorded and some delivered to another system via network. The system is soft real-time, i.e. message latencies higher than 25ms are highly undesirable, but not fatal.
Are relational databases (specifically, MySQL and Postgres) appropriate as the datastore for such system?
Can I expect the DB to work well when it is installed on its own server and has ~50 25fps streams of single-row SQL inserts coming in over the network?
EDIT: I think in general performance would not be a problem, but I worry about the latency variance. If it will occasionally delay for 1000 ms, that would be very bad.
Oh, and the system runs 24/7 so the DB could grow arbitrarily big. Does that degrade the insert latency?
I wouldn't worry too much about performance when choosing a relational database over another type of datastore, choose the solution that best meets your requirements for accessing that data later. However, if you do choose not only a RDBMS but one over the network then you might want to consider buffering events to a local disk briefly on their way over to the DB. Use a separate thread or process or something to push events into the DB to keep the realtime system unaffected.
Biggest problems are how unpredictable the latency will be and how it never goes down, always up. But modern hardware to the rescue, specify a machine with enough cpu cores. You can count on at least two, getting four is easy. So you can spin up a thread and dedicate one core to the dbase updates, isolating it from your soft real-time code. Now you don't care about the variability in the delays, at least as long as the dbase updates don't take so long that you generate data faster than it can consume.
Setup a dbase server and load it up with fake data, double the amount you think it ever needs to store. Test continuously while you develop, add the instrumenting code you need to measure how it is doing at an early stage in the project.
As I've written, if you queue the rows that need to be saved and save them in an async way (so not to stop the "main" thread) there shouldn't be any problem... BUT!!!
You want to save them in a DB... So someone else will read the rows AT THE SAME TIME they are being written. Sadly it's normally quite difficult to tell to a DB "this work is very high priority, everything else can be stalled but not this". So if someone does:
BEGIN TRANSACTION
SELECT COUNT(*) FROM TABLE
WAITFOR DELAY '01:00:00'
(I'm using T-Sql here... But I think it's quite clear. Ask for the COUNT(*) of the table, so that there is a lock on the table and then WAITFOR an hour)
then the writes could be stalled and go in timeout. In general if you configure everyone but the app to be able only to do reads, these problems shouldn't be present.