SQL Server 2014 introduced "memory-optimized tables", whose documentation states that "The primary storage for memory-optimized tables is the main memory. Rows in the table are read from and written to memory. A second copy of the table data is maintained on disk, but only for durability purposes."
This seems to imply a significant performance gain, but doesn't SQL Server have an in-memory buffer cache anyway? If frequent queries are going to use the in-memory buffer cache, why is having a in-memory table providing a significant performance gain?
The "memory-optimized tables" live in memory at all times, ready to be read from the memory at all times. The copy on the disk is like a backup copy just in case if the memory had any issues so it can be reloaded from disk into memory again.
The Buffer cache is used to load data pages from disk into memory, only when there is a request to read those pages and they stay in the memory until the pages are required by other process doing similar request. If the data in memory is no longer required and there is a need to load other pages into memory, those pages already loaded in the memory will get flushed out of memory, until there is a request to read those pages again.
Can you see the difference now? "memory-optimized tables" always live in memory whether someone makes a request to read those pages or not. A standard persisted table will only be cached in memory when someone makes a request for those pages, and will be flushed out if those in memory pages are not being used and there is a need to load more pages in memory.
Related
Hi as per the link mentioned below
spilling described as "When Snowflake cannot fit an operation in memory, it starts spilling data first to disk, and then to remote storage."
Part#1
-cannot fit an operation in memory : is that means the memory size of the warehouse is small to handle a workload and the queries are getting in to queued state ?
what operations could cause this other than join operation?
Part#2
-it starts spilling data first to disk, and then to remote storage : What is disk referred to in this context,as we know warehouse is just the compute unit with no disk in it.
Does this means the data that can't fit in warehouse memory will spill in to storage layer?
-What is referred as "remote storage". Does that means internal stage?
Please help understanding Disk spilling in snowflakes.
https://community.snowflake.com/s/article/Recognizing-Disk-Spilling
Memory is the compute server memory (which is fastest to access), local storage is EBS volume attached to the EC2 and remote storage is S3 storage (slowest to access).
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling). To alleviate this, it's recommended:
Using a larger warehouse (effectively increasing the available memory/local disk space for the operation), and/or Processing data in smaller batches.
Docs Reference: https://docs.snowflake.com/en/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory
Yes, remote spilling is S3 (local is the local instance cache) - and generally when things come to remote spilling the situation is quite bad and the performance of the query is suffering.
Other than rewriting the query you can always try run it on a better warehouse as mentioned in the docs - it will have more cache of its own and spilling should reduce noticeably.
Variations of JOIN like FLATTEN create more rows and aggregation operations like COUNT DISTINCT.
Just yesterday I was do some COUNT DISTINCTs over two year worth of data, with monthly aggregation, and it was spilling, to local and remote.
I realized was doing COUNT(DISTINCT column1,column2) when I wanted COUNT(*) as all those pairs of values where already distinct, and that stopped the remote spill, and to avoid some/most of the local spill I split my SQL into batches of 1 year in size (the data was clustered on time, so the reads where not wasteful), and inserted the result sets into a table. Lastly I ran batches on an extra large warehouse as compared to medium.
I do not know the exact answer where the local/remote disk is, but many EC2 instance some with local disk, so it's possible they use these instances, otherwise it would likely be EBS. I believe remote is S3.
But the moral of the story is, just a PC using swap memory, it's nice to not just have the operation instantly fail, but most of the time you are better off if it did, because how long it's going to take is painful.
I was looking in to the concept of in-memory databases. Articles about that says,
An in-memory database system is a database management system that stores data entirely in main memory.
and they discuss advantages and disadvantages of this concept.
My problem is if these database managements system that stores data entirely in main memory,
do all the data vanish after a power failure???
or are there ways to protect the data ???
Most in-memory database systems offer persistence, at least as an option. This is implemented through transaction logging. On normal shutdown, an in-memory database image is saved. When next re-opened, the previous saved image is loaded and thereafter, every transaction committed to the in-memory database is also appended to a transaction log file. If the system terminates abnormally, the database can be recovered by re-loading the original database image and replaying the transactions from the transaction log file.
The database is still all in-memory, and therefore there must be enough available system memory to store the entire database, which makes it different from a persistent database for which only a portion is cached in memory. Therefore, the unpredictability of a cache-hit or cache-miss is eliminated.
Appending the transaction to the log file can usually be done synchronously or asynchronously, which will have very different performance characteristics. Asynchronous transaction logging will still risk the possibility of losing committed transactions if they were not flushed from the file system buffers and the system is shutdown unexpectedly (i.e. a kernel panic).
In-memory database transaction logging is guaranteed to only ever incur one file I/O to append the transaction to the log file. It doesn't matter if the transaction is large or small, it's still just one write to the persistent media. Further, the writes are always sequential (always appending to the log file), so even on spinning media the performance hit is as small as it can be.
Different media will have greater or lesser impact on performance. HDD will have the greatest, followed by SSD, then memory-tier FLASH (e.g. FusionIO PCIExpress cards) and the least impact coming from NVDIMM memory.
NVDIMM memory can be used to store the in-memory database, or to store the transaction log for recovery. Maximum NVDIMM memory size is less than conventional memory size (and more expensive), but if your in-memory database is some gigabytes in size, this option can retain 100% of the performance of an in-memory database while also providing the same persistence as a conventional database on persistent media.
There are performance comparisons of an in-memory database with transaction logging to HDD, SSD and FusionIO in this whitepaper: http://www.automation.com/pdf_articles/mcobject/McObject_Fast_Durable_Data_Management.pdf
And with NVDIMM in this paper: http://www.odbms.org/wp-content/uploads/2014/06/IMDS-NVDIMM-paper.pdf
The papers were written by us (McObject), but are vendor-neutral.
Reading on SQL Server Book online and my understanding SQL Server Buffer Pool or "Buffer Cache" consists of
a) "Data page Cache" -- pages are always fetched into the data page cache from disk, for both read and write operation if they are not found inside the "cache"
b) "Plan cache" -- procedure cache may not be appropriate term as execution plan is cached for adhoc sql as well as dynamic sql
c) Query work space -- I believe this will be for joins or sort (order by) may be
Question: What else is kept in buffer pool? Does "Log Cache" is also part of buffer pool or "caching of log records" before hardning to transaction log on disk is kept in separate area of memory?
Check out this http://www.toadworld.com/platforms/sql-server/w/wiki/9729.memory-buffer-cache-and-procedure-cache.aspx
Extract from that blog post:
Other portions of buffer pool include:
System level data structures - holds SQL Server instance level data about databases and locks.
Log cache - reserved for reading and writing transaction log pages.
Connection context - each connection to the instance has a small area of memory to record the current state of the connection. This information includes stored procedure and user-defined function parameters, cursor positions and more.
Stack space - Windows allocates stack space for each thread started by SQL Server.
Hope this helps.
This is a somewhat unusual question...
Is there such a thing as too big of an allocation for data and log files for SQL Server?
Please note, that I am NOT talking about running out of space.
Let's assume for the moment that there is infinite storage, but limited I/O throughput. Does the size of the unfilled portions of data and log files the server is accessing matter for performance? For example, if I have a log file for tempdb that only ever fills up to ~5mb, but have a terabyte allocated to it, would the I/O operations accessing this log complete faster if I reduced allocation to 10mb?
No, allocated size will not affect perfomance. Perfomance is affected only on file growth.
In one of my applications I have a 1gb database table that is used for reference data. It has a huge amounts of reads coming off that table but there are no writes ever. I was wondering if there's any way that data could be loaded into RAM so that it doesn't have to be accessed from disk?
I'm using SQL Server 2005
If you have enough RAM, SQL will do an outstanding job determining what to load into RAM and what to seek on disk.
This question is asked a lot and it reminds me of people trying to manually set which "core" their process will run on -- let the OS (or in this case the DB) do what it was designed for.
If you want to verify that SQL is in fact reading your look-up data out of cache, then you can initiate a load test and use Sysinternals FileMon, Process Explorer and Process Monitor to verify that the 1GB table is not being read from. For this reason, we sometimes put our "lookup" data onto a separate filegroup so that it is very easy to monitor when it is being accessed on disk.
Hope this helps.
You're going to want to take a look at memcached. It's what a lot of huge (and well-scaled) sites used to handle problems just like this. If you have a few spare servers, you can easily set them up to keep most of your data in memory.
http://en.wikipedia.org/wiki/Memcached
http://www.danga.com/memcached/
http://www.socialtext.net/memcached/
Just to clarify the issue for the sql2005 and up:
This functionality was introduced for
performance in SQL Server version 6.5.
DBCC PINTABLE has highly unwanted
side-effects. These include the
potential to damage the buffer pool.
DBCC PINTABLE is not required and has
been removed to prevent additional
problems. The syntax for this command
still works but does not affect the
server.
DBCC PINTABLE will explicitly pin a table in core if you want to make sure it remains cached.