Just started getting a bunch of errors on our C# .Net app that seemed to be happening for no reason. Things like System.IndexOutOfRangeException on a SqlDataReader object for an index that should be returned and has been returning for a while now.
Anyways, I looked at the Task Manager and saw that sqlservr.exe was running at around 1,500,000 K Mem Usage. I am by no means a DBA, but that large usage of memory looked wrong to me on a Win Server 2003 R2 Enterprise with Intel Xeon 3.33Ghz with 4GB ram. So I restarted the SQL Server instance. After the restart, everything went back to normal. Errors suddenly stopped occurring. So does this large main memory usage eventually cause errors?
Additionally, I did a quick Google for high memory usage mssql. I found that if left to default settings; SQL Server can grow to be that large. Also, found a link to MS about How to adjust memory usage by using configuration options in SQL Server.
Question now is...how much main memory should SQL Server should be limited to?
I'd certainly be very surprised if it's the database itself, SQLServer is an extremely solid product - far better than anything in Office or Windows itself, and can generally be relied on absolutely and completely.
1.5Gb is nothing for a rdbms - and and all of them will just keep filling up their available buffers with cached data. Reads in core are typically 1000x or more faster than disk access, so using every scrap of memory available to it is optimal design. In fact if you look at any RDBMS design theory you'll see that the algorithms used to decide what to throw away from core are given considerable prominence as it makes a major impact on performance.
Most dedicated DB servers will be running with 4Gb memory (assuming 32bit) with 90% dedicated to SQL Server, so you are certainly not looking at any sort of edge condition here.
Your most likely problem by far is a coding error or structural issue (such as locking)
I do have one caveat though. Very (very, very - like twice in 10 years) occasionally I have seen SQL Server return page tear errors due to corruption in its database files, both times caused by an underlying intermittent hardware failure. As luck would have it on both occasions these were in pages holding the indexes and by dropping the index, repairing the database, backing up and restoring to a new disk I was able to recover without falling back to backups. I am uncertain as to how a page tear error would feed through to the C# API, but conceivably if you have a disk error which only manifests itself after core is full (i.e. it's somewhere on some swap space) then an index out of bounds error does seem like the sort of manifestation I would expect as a call could be returning junk - hence falling outside an array range.
There are a lot of different factors that can come into play as to what limit to set. Typically you want to limit it in a manner that will prevent it from using up too much of the ram on the system.
If the box is a dedicated SQL box, it isn't uncommon to set it to use 90% or so of the RAM on the box....
However, if it is a shared box that has other purposes, there might be other considerations.
how much main memory should MSSQL
should be limited to?
As much as you can give it, while ensuring that other system services can function properly. Yes, it's a vague answer, but on a dedicated DB box, MSSQL will be quite happy with 90% of the RAM or such. By design it will take as much RAM as it can.
1.5GB of 4.0GB is hardly taxing... One of our servers typically runs at 1.6GB of 2.5GB with no problems. I think I'd be more concerned if it wasn't using that much.
I don't mean to sound harsh but I wouldn't be so quick to blame the SQL Server for application errors. From my experience, every time I've tried to pass the buck on to SQL Server, it's bit me in the ass. It's usually sys admins or rogue queries that have brought our server to its knees.
There were several times where the solution to a slow running query was to restart the server instead of inspecting the query, which were almost always at fault. I know I personally rewrote about a dozen queries where the cost was well above 100.
This really sounds like a case of "'select' is broken" so I'm curious if you could find any improvements in your code.
SQL needs the ram that it is taking. If it was using 1.5 gigs, its using that for data cache, procedure cache, etc. Its generally better left alone - if you set a cap too low, you'll end up hurting performance. If its using 1.5 gigs on a 4 gig web box, i wouldn't call that abnormal at all.
Your errors could very likely have been caused by locking - i'd have a hard time saying that the SQL memory usage that you defined in the question was causing the errors you were getting.
Related
We have an Azure SQL database. Up until a few weeks ago, we were set at 10 DTUs (S0). Recently, we've gotten more SQL timeout errors, prompting us to increase our DTUs to 50 (S2). We get the errors less frequently, but still on occasion. When we get these timeouts, we see spikes on the Resource graph hitting 100%. Drilling into that, it's generally Data I/O operations that are making it spike. But when we check Query Performance Insight, none of the listed queries show that they're using that much resources.
Another thing to note is that our database has grown steadily in size. It is now about 19 GB, and the majority (18 GB) of that comes from one table that has a lot of long JSON strings in it. The timeout errors generally do happen on a certain query that has several joins, but they do not interact with the table with the long strings.
We tested making a copy of the database and removing all the long strings, and it did not get any timeouts at 10 DTU, but performed the same as the database with all the long strings at 50 DTU as far as load times.
We have rebuilt our indexes and, though it helped, we continue to experience timeout errors.
Given that the query that gets timeouts is not touching the table with long strings, could the table with long strings still be the culprit for DTU usage? Would it have to do with SQL caching? Could the long strings be hogging the cache and causing a lot of data I/O? (They are accessed fairly frequently too.)
The strings can definitely exhaust your cache budget if they are hot (as you say they are). When the hot working set exceeds RAM cache size performance can fall off a cliff (10-100x). That's because IO is 10-1000x slower than RAM access. This means that even a tiny decrease in cache hit ratio (such as 1%) can multiply into a big performance loss.
This cliff can be very steep. One moment the app is fine, the next moment IO is off the charts.
Since Azure SQL Database has strict resource limits (as I hear and read) this can quickly exhaust the performance that you bought bring on throttling.
I think the test you made kind of confirms that the strings are causing the problem. Can you try to segregate the strings somewhere else? If they are cold move them to another table. If they are hot move them to another datastore (database or NoSQL). That way you can likely move back to a much lower tier.
I'm running soak tests at the moment and keep coming up against a wierd issue that I've never seen in the past. I've spent quite a while investigating the issue and so far not got to be bottom of it.
At some point during the test (sometimes 1 hour in, other times 4+ hours) the SQL Server machine starts maxing it's CPU. This always corresponds with a sharp decrease in DB cache memory and increase in free memory.
The signs obviously point at memory pressure and it seems that I can sometimes trigger this event by running a particularly heavy query.
I can understand why the plan cache is being flushed however the aspects of this that are confusing me are:
After the plan cache is flushed and my meaty query finishes there is plenty of free memory (even after further increasing the amount of memory SQL Server is allowed) the plan cache doesn't seem to recover. I'm left with loads of free memory which isn't helping anyone.
If I stop my soak test and then re-run it immediatly then things go back to normal, the plan cache grows as expected. SQL Server does not need restarted or to have any settings altered.
After the cache flush the cache hit ratio is still OK-ish, ~90% however this is much lower than the ~99% I am seeing before the flush and really hurting the CPU.
Before the flush a trace of cache misses, inserts and hits looks normal enough. Pre-flush the only issue I see is a non-parameterised ad-hoc query that's being inserted into the cache very frequently however even with this it's a very simple query which has a low cost so would expect these to be flushed from the cache ahead of most other things.
Post flush I'm seeing a very high number of inserts followed immediately by numerous misses on the same object (i.e. stored procedures), and thus memory consumption for the cache remains low.
You can see from the yellow line in the shot of my counters below that the cache memory usage drops off and stays low yet the free memory (royal blue) stays fairly high.
EDIT
After looking into this issue for another good while a pattern that keeps appearing is that if I push the server to it's limit for a short time (adding load above what the soak test is producing) then SQL Server seems to get itself into a mess which it can't recover from on it's own.
The number of connections to the server sharply increases when it hits the point of maximum pressure (I'm assuming due to it not being able to deal with requests quickly enough so new connections are needed to deal with the "constant" flow of requests). This backlog is then placing further pressure on the server which it doesn't appear to be able to recover from.
Now, I'm still puzzled by the metrics. I could accept this as purely a server resource issue if the new connections seemed to be eating up memory, further slowing processing, causing new connections, etc. What I am seeing though is that there is plenty of free memory but SQL Server isn't using it for the plan cache. Because of this it's spending more time compiling, upping CPU and things spiral out of control.
It feels like the connections are the key part of this problem. As mentioned before if I restart the test everything goes back to normal. I've since found that putting the DB into single user mode for a few seconds so that all test related connections die, waiting a few seconds and then going back to multi-user mode resolves the issue. I've tried just killing all active connections based on SPID however it seems there needs to be a pause of a few seconds in order for the server to recover and start using the plan cache properly.
See screenshot below of my counters. I'm trying to push the server over the top up to ~02:33:15 and I set to single user mode at ~02:34:30 and then multi-user mode a few seconds after.
Purple line is user connections, thick red is compilations p/s, bright green is cache memory, aqua connection memory, greyish/brown is free memory.
OK, it's been a long circular road but the best answer I currently have for this is that this issue is due to resource constraints and the unfortunate choices that SQL Server makes in relation to the plan cache for my particular circumstances. I'm not saying SQL Server is wrong, just that for my needs at this time I don't think it's making the right decisions.
I've adjusted my soak test so that if the DB server comes under pressure it pulls on the reigns a bit and drops some connections, until such time that the server comes back under control and the additional connections can be reestablished. The process of SQL Server getting itself back in order can take a few minutes but it does happen!
It seems that the server was getting itself into a vicious cycle, where it was coming under pressure, dropping cached plans and then having to spend more on recompiling these plans later than it gained by dropping them in the first place. This lead to things spiraling out of control and everything grinding to a halt.
In my particular case there is a very high cache hit ratio (above 99.5%) and due to the soak test basically doing the same thing repeatedly for hours for loads of users the cache is very well used. If the cache weren't so well used then SQL Server would have quite possibly made the right choice by dropping plans but I don't think it did here.
Sorry for the long introduction but before I can ask my question, I think giving the background would help understanding our problem much better.
We are using sql server 2008 for our web services as the backend and from time to time it takes too much time for responding back for the requests that supposed to run really fast, like taking more than 20 seconds for a select request that queries a table that has only 22 rows. We went through many potential areas that could cause the issue from indexes to stored procedures, triggers etc, and tried to optimize whatever we can like removing indexes that are not read but write frequently or adding NOLOCK for our select queries to reduce the locking of the tables (we are OK with dirty reads).
We also had our DBA's reviewed the server and benchmarked the components to see any bottlenecks in CPU, memory or disk subsystem, and found out that hardware-wise we are OK as well. And since the pikes are occurring occasionally, it is really hard to reproduce the error on production or development because most of the time when we rerun the same query it yields response times that we are expecting, which are short, not the one that has been experienced earlier.
Having said that, I almost have been suspicious about I/O although it is not seem to be a bottleneck. But I think I was just be able to reproduce the error after running an index fragmentation report for a specific table on the server, which immediately caused pikes in requests not only run against that table but also in other requests that query other tables. And since the DB, and the server, is shared with other applications we use and also from time to time queries can be run on the server and database that take long time is a common scenario for us, my suspicion regarding occasional I/O bottleneck is, I believe, becoming a fact.
Therefore I want to find out a way that would prioritize requests that are coming from web services which will be processed even if there are other resource sensitive queries being run. I have been looking for some kind of prioritization I described above since very beginning of the resolution process and found out that SQL Server 2008 has a feature called 'Resource Governor' that allows prioritization of the requests.
However, since I am not an expert on Resource Governor nor a DBA, I would like to ask other people's experience who may have used or is using Resource Governor, as well as whether I can prioritize I/O for a specific login or a specific stored procedure (For example, if one I/O intensive process is being run at the time we receive a web service request, can SQL server stops, or slows down, I/O activity for that process and give a priority to the request we just received?).
Thank you for anyone that spends time on reading or helping out in advance.
Some Hardware Details:
CPU: 2x Quad Core AMD Opteron 8354
Memory: 64GB
Disk Subsystem: Compaq EVA8100 series (I am not sure but it should be RAID 0+1 accross 8 HP HSV210 SCSI drives)
PS:And I can almost 100 percent sure that application servers are not causing the error and there is no bottleneck we can identify there.
Update 1:
I'll try to answer as much as I can for the following questions that gbn asked below. Please let me know if you are looking something else.
1) What kind of index and statistics maintenance do you have please?
We have a weekly running job that defrags indexes every Friday. In addition to that, Auto Create Statistics and Auto Update Statistics are enabled. And the spikes are occurring in other times than the fragmentation job as well.
2) What kind of write data volumes do you have?
Hard to answer.In addition to our web services, there is a front end application that accesses the same database and periodically resource intensive queries needs to be run to my knowledge, however, I don't know how to get, let's say weekly or daily, write amount to DB.
3) Have you profiled Recompilation and statistics update events?
Sorry for not be able to figure out this one. I didn't understand what you are asking about by this question. Can you provide more information for this question, if possible?
first thought is that statistics are being updated because of the data change threshold is reached causing execution plans to be rebuilt.
What kind of index and statistics maintenance do you have please? Note: index maintenance updates index stats, not column stats: you may need separate stats updates.
What kind of write data volumes do you have?
Have you profiled Recompilation and statistics update events?
In response to question 3) of your Update to the original question, take a look at the following reference on SQL Server Pedia. It provides an explanation of what query recompiles are and also goes on to explain how you can monitor for these events. What I believe gbn is asking (feel free to correct me sir :-) ) is are you seeing recompile events prior to the slow execution of the troublesome query. You can look for this occurring by using the SQL Server Profiler.
Reasons for Recompiling a Query Execution Plan
Suddenly our SQL server is using 100% CPU but only using a fraction of the memory it can use (16 GB available).
We're using web edition and allocated a maximum amount of ram.
Like i say this has just suddenly happened without us changing anything.
Need some ideas desperately as it's crippling us
Please do not be tricked by the memory usage shown in task manager - it cannot see what SQL server is really using. You want to be looking at:
SELECT * FROM sys.dm_os_sys_memory DOSM
in particular the system_memory_state_desc column will tell you if you have memory pressure.
High CPU usage could be one of a few other problems:
Has an index been dropped (without your knowledge)?
Do you have indexes at all?
Have you recently seen higher usage of the system (more users/more data)?
Has the system recently been restarted (thus emptying cache and causing re-compiles for queries)?
Has a query/sproc/function been changed (again without your knowledge)?
I would check these things before going further.
I'd read over this article and make sure you have done everything required. I know you probably think you have but double check just to be sure...
Further to my previous question about the Optimal RAID setup for SQL server, could anyone suggest a quick and dirty way of benchmarking the database performance on the new and old servers to compare them? Obviously, the proper way would be to monitor our actual usage and set up all sorts of performance counters and capture the queries, etc., but we are just not at that level of sophistication yet and this isn't something we'll be able to do in a hurry. So in the meanwhile, I'm after something that would be a bit less accurate, but quick to do and still better than nothing. Just as long as it's not misleading, which would be worse than nothing. It should be SQL Server specific, not just a "synthetic" benchmark. It would be even better if we could use our actual database for this.
Measure the performance of your application itself with the new and old servers. It's not necessarily easy:
Set up a performance test environment with your application on (depending on your architecture this may consist of several machines, some of which may be able to be VMs, but some of which may not be)
Create "driver" program(s) which give the application simulated work to do
Run batches of work under the same conditions - remember to reboot the database server between runs to nullify effects of caching (Otherwise your 2nd and subsequent runs will probably be amazingly fast)
Ensure that the performance test environment has enough hardware machines in to be able to load the database heavily - this may mean swapping out some VMs for real hardware.
Remember to use production-grade hardware in your performance test environment - even if it is expensive.
Our database performance test cluster contains six hardware machines, several of which are production-grade, one of which contains an expensive storage array. We also have about a dozen VMs on a 7th simulating other parts of the service.
you can always insert, read, and delete a couple of million rows - it's not a realistic mix of operations but it should strain the disks nicely...
Find at least a couple of the queries that are taking some time, or at least that you suspect are taking time, insert a lot of data if you don't have it already, and run the queries having set:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SET STATISTICS PROFILE ON
Those should give you a rough idea of the resources being consumed.
You can also run SQL Server Profiler to get a general idea of what queries are taking a long time and how long they are taking plus other statistics. It outputs a lot of data so try to filter it down a little bit, possibly by long duration or one of the other performance statistics.