What is Disk Spilling means and how to avoid that in snowflakes - snowflake-cloud-data-platform

Hi as per the link mentioned below
spilling described as "When Snowflake cannot fit an operation in memory, it starts spilling data first to disk, and then to remote storage."
Part#1
-cannot fit an operation in memory : is that means the memory size of the warehouse is small to handle a workload and the queries are getting in to queued state ?
what operations could cause this other than join operation?
Part#2
-it starts spilling data first to disk, and then to remote storage : What is disk referred to in this context,as we know warehouse is just the compute unit with no disk in it.
Does this means the data that can't fit in warehouse memory will spill in to storage layer?
-What is referred as "remote storage". Does that means internal stage?
Please help understanding Disk spilling in snowflakes.
https://community.snowflake.com/s/article/Recognizing-Disk-Spilling

Memory is the compute server memory (which is fastest to access), local storage is EBS volume attached to the EC2 and remote storage is S3 storage (slowest to access).
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling). To alleviate this, it's recommended:
Using a larger warehouse (effectively increasing the available memory/local disk space for the operation), and/or Processing data in smaller batches.
Docs Reference: https://docs.snowflake.com/en/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory

Yes, remote spilling is S3 (local is the local instance cache) - and generally when things come to remote spilling the situation is quite bad and the performance of the query is suffering.
Other than rewriting the query you can always try run it on a better warehouse as mentioned in the docs - it will have more cache of its own and spilling should reduce noticeably.

Variations of JOIN like FLATTEN create more rows and aggregation operations like COUNT DISTINCT.
Just yesterday I was do some COUNT DISTINCTs over two year worth of data, with monthly aggregation, and it was spilling, to local and remote.
I realized was doing COUNT(DISTINCT column1,column2) when I wanted COUNT(*) as all those pairs of values where already distinct, and that stopped the remote spill, and to avoid some/most of the local spill I split my SQL into batches of 1 year in size (the data was clustered on time, so the reads where not wasteful), and inserted the result sets into a table. Lastly I ran batches on an extra large warehouse as compared to medium.
I do not know the exact answer where the local/remote disk is, but many EC2 instance some with local disk, so it's possible they use these instances, otherwise it would likely be EBS. I believe remote is S3.
But the moral of the story is, just a PC using swap memory, it's nice to not just have the operation instantly fail, but most of the time you are better off if it did, because how long it's going to take is painful.

Related

PostgreSQL RDS running out of Free Storage Space while querying

We have a read-only PostgreSQL RDS database which is heavily queried. We don't perform any inserts/updates/deletes during normal traffic, but still we can see how we are running out of Free Storage Space and an increase on Write IOPS metric. During this period, CPU usage is at 100%.
At some point, the storage space seems to be released.
Is this expected?
The issue was in the end related to our logs. log_statement was set to all, where every single query to PG would be log. In order to troubleshoot long time queries, we combined log_statement and log_min_duration_statement.
Since this is a read only database we want to know if there is any insert/update/delete operation, so log_statement: dll ; and we want to know which queries are taking longer than 1s: log_min_duration_statement: 1000

Fastest throughput Local, RAM Cahced DB

I'm looking for a DB solution for a high performance application.
The database will need to be local and stored in RAM for performance and will be several GB in size.
It will be local to the application, but it may be accessed by multiple processes running on the machine (up to 40). The data in the DB is immutable once it's been inserted and I only need a basic key value store rather than anything relational.
The obvious candidates are Memcached and Redis, but I believe they both have limitations with overhead and bottlenecks from the network component.
Something like Berkeley DB would also appear to be ideal, but it's only single process as far as I can see.
Throughput is the most important consideration (more so than latency).

Shared volume for data (multiple MDF) and another shared volume for logs (multiple LDF) on SAN

I have 3 instances of SQL Server 2008, each on different machines with multiple databases on each instance. I have 2 separate LUNS on my SAN for MDF and LDF files. The NDX and TempDB files run on the local drive on each machine. Is it O.K. for the 3 instances to share a same volume for the data files and another volume for the log files?
I don't have thin provisioning on the SAN so I would like to not constaint disk space creating multiple volumes because I was adviced that I should create a volume (drive letter) for each instance, if not for each database. I am aware that I should split my logs and data files at least. No instance would share the actual database files, just the space on drive.
Any help is appretiated.
Of course the answer is: "It depends". I can try to give you some hints on what it depends however.
A SQL Server Instance "assumes" that it has exclusive access to its resources. So it will fill all available RAM per default, it will use all CPUs and it will try to saturate the I/O channels to get maximum performance. That's the reason for the general advice to keep your instances from concurrently accessing the same disks.
Another thing is that SQL Server "knows" that sequential I/O access gives you much higher trhoughput than random I/O, so there are a lot of mechanisms at work (like logfile organization, read-ahead, lazy writer and others) to avoid random I/O as much as possible.
Now, if three instances of SQL Server do sequential I/O requests on a single volume at the same time, then from the perspective of the volume you are getting random I/O requests again, which hurts your performance.
That being said, it is only a problem if your I/O subsystem is a significant bottleneck. If your logfile volume is fast enough that the intermingled sequential writes from the instances don't create a problem, then go ahead. If you have enough RAM on the instances that data reads can be satisfied from the buffer cache most of the time, you don't need much read performance on your I/O subsystem.
What you should avoid in each case is multiple growth steps on either log or data files. If several files on one filesystem are growing, you will get fragmentation and fragmentation can transform a sequential read or write request even from a single source to random I/O again.
The whole picture changes again if you use SSDs as disks. These have totally different requirements and behaviour, but since you didn't say anything about SSD I will assume that you use a "conventional" disk-based array or RAID configuration.
Short summary: You might get away with it, if the circumstances are right, but it is hard to assess without knowing a lot more about your systems, from both the SAN and SQL perspective.

SQL Server CPU vs. Storage Bottlenecking

I've read quite a bit about SQL Servers using SSDs performing much better than traditional hard drives. In load tests with my app in a test environment, though, I'm able to keep my test DB server (SQL 2005) pegged between 75% and 100% CPU usage without much of a strain on disk access (as far as I can tell). My data set is still pretty small; database backups are under 100 MB. The test server I'm using is not new, but is also no slouch.
So, my questions:
Is the CPU the bottleneck (as opposed to the storage) because the dataset is small and therefore fits in memory?
Will this change once the dataset grows so paging is necessary?
Approximately how big (as a percentage of system memory) does the dataset have to get before SQL Server starts paging? Or does that depend on a lot of other factors?
As the app and its dataset grows, are there other bottlenecks that will tend to crop up besides CPU, storage, and lack of proper indexes?
Yes
Yes
If you have SQL Server configured to use as much memory as it can get, probably when it exceeds the max system memory. But it's very setup dependant on what causes paging (the query that is being executed is the most prevalent cause).
I/O between the request machine and server is the only one that I can think of, and that only matters if you are retrieving large datasets. I also would not group a lack of indexes as a bottleneck, rather indexes enable better performance with regard to searching.
As long as the CPU is the bottleneck on your dedicated SQL-Server machine, you don't have to worry about disk speed (assuming nothing's wrong with the machine). SQL-Server WILL use heavy memory caching. SQL-Server has built-in strategies to perform best under a given load and available resources. Just don't worry about it!

Tracking down data load performance issues in SSIS package

Are there any ways to determine what the differences in databases are that affect a SSIS package load performance ?
I've got a package which loads and does various bits of processing on ~100k records on my laptop database in about 5 minutes
Try the same package and same data on the test server, which is a reasonable box in both CPU and memory, and it's still running ... about 1 hour so far :-(
Checked the package with a small set of data, and it ran through Ok
I've had similar problems over the past few weeks, and here are several things you could consider, listed in decreasing order of importance according to what made the biggest difference for us:
Don't assume anything about the server.
We found that our production server's RAID was miscconfigured (HP sold us disks with firmware mismatches) and the disk write speed was literally a 50th of what it should be. So check out the server metrics with Perfmon.
Check that enough RAM is allocated to SQL Server. Inserts of large datasets often require use of RAM and TempDB for building indices, etc. Ensure that SQL has enough RAM that it doesn't need to swap out to Pagefile.sys.
As per the holy grail of SSIS, avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model. The only difference between Simple and Full recovery models is that Simple automatically truncates the log file after each transactions. This means that large datasets, when manipulated with T-SQL, thrash the log file, killing performance.
For large datasets, do data sorts at the source if possible. The SSIS Sort component chokes on reasonably large datasets, and the only viable alternative (nSort by Ordinal, Inc.) costs $900 for a non-transferrable per CPU license. So... if you absolutely have to a large dataset then consider loading it into a staging database as an intermediate step.
Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.
Increase the network packaet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source server/s, and can noticably improve reads on large datasets.
If using Lookup transforms, experiment with cache sizes - between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.
If combining multiple large datasets, use either RAW files or a staging database to hold your transformed datasets, then combine and insert all of a table's data in a single data flow operation, and lock the destination table. Using staging tables or RAW files can also help relive table locking contention.
Last but not least, experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You'll need to monitor your package's "Buffers Spooled" performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.
Point 8 is especially important on very large datasets, since you can only achieve a minimally logged bulk insert operation if:
The destination table is empty, and
The table is locked for the duration of the load operation.
The database is in Simply / Bulk Logged recovery mode.
This means that subesquent bulk loads a table will always be fully logged, so you want to get as much data as possible into the table on the first data load.
Finally, if you can partition you destination table and then load the data into each partition in parallel, you can achieve up to 2.5 times faster load times, though this isn't usually a feasible option out in the wild.
If you've ruled out network latency, your most likely culprit (with real quantities of data) is your pipeline organisation. Specifically, what transformations you're doing along the pipeline.
Data transformations come in four flavours:
streaming (entirely in-process/in-memory)
non-blocking (but still using I/O, e.g. lookup, oledb commands)
semi-blocking (blocks a pipeline partially, but not entirely, e.g. merge join)
blocking (blocks a pipeline until it's entirely received, e.g. sort, aggregate)
If you've a few blocking transforms, that will significantly mash your performance on large datasets. Even semi-blocking, on unbalanced inputs, will block for long periods of time.
In my experience the biggest performance factor in SSIS is Network Latency. A package running locally on the server itself runs much faster than anything else on the network. Beyond that I can't think of any reasons why the speed would be drastically different. Running SQL Profiler for a few minutes may yield some clues there.
CozyRoc over at MSDN forums pointed me in the right direction ...
- used the SSMS / Management / Activity Monitor and spotted lots of TRANSACTION entries
- got me thinking, read up on the Ole Db connector and unchecked the Table Lock
- WHAM ... data loads fine :-)
Still don't understand why it works fine on my laptop d/b, and stalls on the test server ?
- I was the only person using the test d/b, so it's not as if there should have been any contention for the tables ??

Resources