We have a read-only PostgreSQL RDS database which is heavily queried. We don't perform any inserts/updates/deletes during normal traffic, but still we can see how we are running out of Free Storage Space and an increase on Write IOPS metric. During this period, CPU usage is at 100%.
At some point, the storage space seems to be released.
Is this expected?
The issue was in the end related to our logs. log_statement was set to all, where every single query to PG would be log. In order to troubleshoot long time queries, we combined log_statement and log_min_duration_statement.
Since this is a read only database we want to know if there is any insert/update/delete operation, so log_statement: dll ; and we want to know which queries are taking longer than 1s: log_min_duration_statement: 1000
Related
Hi as per the link mentioned below
spilling described as "When Snowflake cannot fit an operation in memory, it starts spilling data first to disk, and then to remote storage."
Part#1
-cannot fit an operation in memory : is that means the memory size of the warehouse is small to handle a workload and the queries are getting in to queued state ?
what operations could cause this other than join operation?
Part#2
-it starts spilling data first to disk, and then to remote storage : What is disk referred to in this context,as we know warehouse is just the compute unit with no disk in it.
Does this means the data that can't fit in warehouse memory will spill in to storage layer?
-What is referred as "remote storage". Does that means internal stage?
Please help understanding Disk spilling in snowflakes.
https://community.snowflake.com/s/article/Recognizing-Disk-Spilling
Memory is the compute server memory (which is fastest to access), local storage is EBS volume attached to the EC2 and remote storage is S3 storage (slowest to access).
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling). To alleviate this, it's recommended:
Using a larger warehouse (effectively increasing the available memory/local disk space for the operation), and/or Processing data in smaller batches.
Docs Reference: https://docs.snowflake.com/en/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory
Yes, remote spilling is S3 (local is the local instance cache) - and generally when things come to remote spilling the situation is quite bad and the performance of the query is suffering.
Other than rewriting the query you can always try run it on a better warehouse as mentioned in the docs - it will have more cache of its own and spilling should reduce noticeably.
Variations of JOIN like FLATTEN create more rows and aggregation operations like COUNT DISTINCT.
Just yesterday I was do some COUNT DISTINCTs over two year worth of data, with monthly aggregation, and it was spilling, to local and remote.
I realized was doing COUNT(DISTINCT column1,column2) when I wanted COUNT(*) as all those pairs of values where already distinct, and that stopped the remote spill, and to avoid some/most of the local spill I split my SQL into batches of 1 year in size (the data was clustered on time, so the reads where not wasteful), and inserted the result sets into a table. Lastly I ran batches on an extra large warehouse as compared to medium.
I do not know the exact answer where the local/remote disk is, but many EC2 instance some with local disk, so it's possible they use these instances, otherwise it would likely be EBS. I believe remote is S3.
But the moral of the story is, just a PC using swap memory, it's nice to not just have the operation instantly fail, but most of the time you are better off if it did, because how long it's going to take is painful.
I have one database whose size is growing very fast. Curruntly its size is aruond 60GB however after executing db_spaceused stored procedure i could verify that more that 40 GB is unused(unused space is different, not reserved space which i unsderstand is for table growth). And actual data size is around 10-12 GB and few GB's in reserved space.
Now to collect that unsused space i tried to use the shrink operation but it turned out to be not helping. After searching further i also found not to use the shrink DB as that causes the data fragments to get genrated resulting in the dealay while disk operation. Now i am really not sure what other operation i should try to recollect the space and recollect the DB. I unsertand that due to the size queries might be taking longer that expected and reclaiming this space could help with the performance (not sure ).
While investigating i also come across Gererate Scripts feature. It helps exporting data, schema also but i am not sure if it also help creating script(everying user, permission and other things also) so that script will help creat as is replica(deep copy/clone) of DB using create scema and then populating it with data to other db/server ?
Any pointer would be helpful.
If your database is 60Gb it means it had grown to 60gb. Even if the data is only 20Gb, you probably have operations that grow the data from time to time (eg. nightly index maintenance jobs). The recommendation is to leave the database at 60Gb. Do not attempt to reclaim the space, you will only cause harm and whatever caused the database to grow to 60Gb to start with it will likely occur again and trigger database growth.
In fact, you should to the opposite. Try to identify why it grew to 60Gb and extrapolate what will happen when your data reaches 30Gb. Will the database grow to 90Gb? If yes, the you should grow it now to 90Gb. The last thing you want is for growth to occur randomly, and possibly run out of disk space at a critical moment. In fact you should check right now if your server has Instant File initialization enabled.
Now of course, the question is: what would cause 3x data size growth, and how to identify it? I don't know of any easy method. I would recommend start by looking at your SQL Agent jobs. Check your maintenance scripts. Look into the application itself, does it has a pattern of growth and delete for data? Look at past backups (you do have them, right?) and compare.
BTW I assume due diligence and you checked that the data file has grown to 60Gb. If is the LOG file that has grown then is easy, it means you enabled full recovery model and forgot to backup the log.
I have a client with a very large database on Sql Server 2005. The total space allocated to the db is 15Gb with roughly 5Gb to the db and 10 Gb to the transaction log. Just recently a web application that is connecting to that db is timing out.
I have traced the actions on the web page and examined the queries that execute whilst these web operation are performed. There is nothing untoward in the execution plan.
The query itself used multiple joins but completes very quickly. However, the db server's CPU hikes to 100% for a few seconds. The issue occurs when several simultaneous users are working on the system (when I say multiple .. read about 5). Under this timeouts start to occur.
I suppose my question is, can a large transaction log cause issues with CPU performance? There is about 12Gb of free space on the disk currently. The configuration is a little out of my hands but the db and log are both on the same physical disk.
I appreciate that the log file is massive and needs attending to, but I'm just looking for a heads up as to whether this may cause CPU spikes (ie trying to find the correlation). The timeouts are a recent thing and this app has been responsive for a few years (ie its a recent manifestation).
Many Thanks,
It's hard to say exactly given the lack of data, but the spikes are commonly observed on transaction log checkpoint.
A checkpoint is a procedure of applying data sequentially appended and stored in the transaction log to the actual datafiles.
This involves lots of I/O, including CPU operations, and may be a reason of the CPU activity spikes.
Normally, a checkpoint occurs when a transaction log is 70% full or when SQL Server decides that a recovery procedure (reapplying the log) would take longer than 1 minute.
Your first priority should be to address the transaction log size. Is the DB being backed up correctly, and how frequently. Address theses issues and then see if the CPU spikes go away.
CHECKPOINT is the process of reading your transaction log and applying the changes to the DB file, if the transaction log is HUGE then it makes sense it could affect it?
You could try extending the autogrowth: Kimberley Tripp suggests upwards of 500MB autogrowth for transaction logs measured in GBs:
http://www.sqlskills.com/blogs/kimberly/post/8-Steps-to-better-Transaction-Log-throughput.aspx
(see point 7)
While I wouldn't be surprised if having a log that size wasn't causing a problem, there are other things it could be as well. Have the statistics been updated lately? Are the spikes happening when some automated job is running, is there a clear time pattern to when you have the spikes - then look at what else is running? Did you load a new version of anything on the server about the time the spikes started happeining?
In any event, the transaction log needs to be fixed. The reason it is so large is that it is not being backed up (or not backed up frequently enough). It is not enough to back up the database, you must also back up the log. We back ours up every 15 minutes but ours is a highly transactional system and we cannot afford to lose data.
If you run multiple DBs on the same SQL Server do they all fight for Procedure Cache? What I am trying to figure out is how does SQL Server determine how long to hold onto Procedure Cache? If other DBs are consuming memory will that impact the procedure cache for a given DB on that same server?
I am finding that on some initial loads of page within our application that it is slow, but once the queries are cachced it is obviously fast. Just not sure how long SQL Server keeps procedure cache and if other DBs will impact that amount of time.
The caching/compiling happens end to end
IIS will unload after 20 mins of not used by default.
.net compilation to CLR
SQL compilation
loading data into memory
This is why the initial calls take some time
Generally stuff stays in cache:
while still in use
no memory pressure
still valid (eg statistics updates will invalidate cached plans)
If you are concerned, add more RAM. Also note that each database will have different load patterns and SQL Server will juggle memory very well. Unless you don't have enough RAM...
From the documentation:
Execution plans remain in the procedure cache as long as there is enough memory to store them. When memory pressure exists, the Database Engine uses a cost-based approach to determine which execution plans to remove from the procedure cache. To make a cost-based decision, the Database Engine increases and decreases a current cost variable for each execution plan according to the following factors.
This link might also be of interest to you: Most Executed Stored Procedure?
Are there any ways to determine what the differences in databases are that affect a SSIS package load performance ?
I've got a package which loads and does various bits of processing on ~100k records on my laptop database in about 5 minutes
Try the same package and same data on the test server, which is a reasonable box in both CPU and memory, and it's still running ... about 1 hour so far :-(
Checked the package with a small set of data, and it ran through Ok
I've had similar problems over the past few weeks, and here are several things you could consider, listed in decreasing order of importance according to what made the biggest difference for us:
Don't assume anything about the server.
We found that our production server's RAID was miscconfigured (HP sold us disks with firmware mismatches) and the disk write speed was literally a 50th of what it should be. So check out the server metrics with Perfmon.
Check that enough RAM is allocated to SQL Server. Inserts of large datasets often require use of RAM and TempDB for building indices, etc. Ensure that SQL has enough RAM that it doesn't need to swap out to Pagefile.sys.
As per the holy grail of SSIS, avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model. The only difference between Simple and Full recovery models is that Simple automatically truncates the log file after each transactions. This means that large datasets, when manipulated with T-SQL, thrash the log file, killing performance.
For large datasets, do data sorts at the source if possible. The SSIS Sort component chokes on reasonably large datasets, and the only viable alternative (nSort by Ordinal, Inc.) costs $900 for a non-transferrable per CPU license. So... if you absolutely have to a large dataset then consider loading it into a staging database as an intermediate step.
Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.
Increase the network packaet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source server/s, and can noticably improve reads on large datasets.
If using Lookup transforms, experiment with cache sizes - between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.
If combining multiple large datasets, use either RAW files or a staging database to hold your transformed datasets, then combine and insert all of a table's data in a single data flow operation, and lock the destination table. Using staging tables or RAW files can also help relive table locking contention.
Last but not least, experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You'll need to monitor your package's "Buffers Spooled" performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.
Point 8 is especially important on very large datasets, since you can only achieve a minimally logged bulk insert operation if:
The destination table is empty, and
The table is locked for the duration of the load operation.
The database is in Simply / Bulk Logged recovery mode.
This means that subesquent bulk loads a table will always be fully logged, so you want to get as much data as possible into the table on the first data load.
Finally, if you can partition you destination table and then load the data into each partition in parallel, you can achieve up to 2.5 times faster load times, though this isn't usually a feasible option out in the wild.
If you've ruled out network latency, your most likely culprit (with real quantities of data) is your pipeline organisation. Specifically, what transformations you're doing along the pipeline.
Data transformations come in four flavours:
streaming (entirely in-process/in-memory)
non-blocking (but still using I/O, e.g. lookup, oledb commands)
semi-blocking (blocks a pipeline partially, but not entirely, e.g. merge join)
blocking (blocks a pipeline until it's entirely received, e.g. sort, aggregate)
If you've a few blocking transforms, that will significantly mash your performance on large datasets. Even semi-blocking, on unbalanced inputs, will block for long periods of time.
In my experience the biggest performance factor in SSIS is Network Latency. A package running locally on the server itself runs much faster than anything else on the network. Beyond that I can't think of any reasons why the speed would be drastically different. Running SQL Profiler for a few minutes may yield some clues there.
CozyRoc over at MSDN forums pointed me in the right direction ...
- used the SSMS / Management / Activity Monitor and spotted lots of TRANSACTION entries
- got me thinking, read up on the Ole Db connector and unchecked the Table Lock
- WHAM ... data loads fine :-)
Still don't understand why it works fine on my laptop d/b, and stalls on the test server ?
- I was the only person using the test d/b, so it's not as if there should have been any contention for the tables ??