What is Disk Spilling means and how to avoid that in snowflakes - snowflake-cloud-data-platform

Hi as per the link mentioned below
spilling described as "When Snowflake cannot fit an operation in memory, it starts spilling data first to disk, and then to remote storage."
-cannot fit an operation in memory : is that means the memory size of the warehouse is small to handle a workload and the queries are getting in to queued state ?
what operations could cause this other than join operation?
-it starts spilling data first to disk, and then to remote storage : What is disk referred to in this context,as we know warehouse is just the compute unit with no disk in it.
Does this means the data that can't fit in warehouse memory will spill in to storage layer?
-What is referred as "remote storage". Does that means internal stage?
Please help understanding Disk spilling in snowflakes.

Memory is the compute server memory (which is fastest to access), local storage is EBS volume attached to the EC2 and remote storage is S3 storage (slowest to access).
This spilling can have a profound effect on query performance (especially if remote disk is used for spilling). To alleviate this, it's recommended:
Using a larger warehouse (effectively increasing the available memory/local disk space for the operation), and/or Processing data in smaller batches.
Docs Reference: https://docs.snowflake.com/en/user-guide/ui-query-profile.html#queries-too-large-to-fit-in-memory

Yes, remote spilling is S3 (local is the local instance cache) - and generally when things come to remote spilling the situation is quite bad and the performance of the query is suffering.
Other than rewriting the query you can always try run it on a better warehouse as mentioned in the docs - it will have more cache of its own and spilling should reduce noticeably.

Variations of JOIN like FLATTEN create more rows and aggregation operations like COUNT DISTINCT.
Just yesterday I was do some COUNT DISTINCTs over two year worth of data, with monthly aggregation, and it was spilling, to local and remote.
I realized was doing COUNT(DISTINCT column1,column2) when I wanted COUNT(*) as all those pairs of values where already distinct, and that stopped the remote spill, and to avoid some/most of the local spill I split my SQL into batches of 1 year in size (the data was clustered on time, so the reads where not wasteful), and inserted the result sets into a table. Lastly I ran batches on an extra large warehouse as compared to medium.
I do not know the exact answer where the local/remote disk is, but many EC2 instance some with local disk, so it's possible they use these instances, otherwise it would likely be EBS. I believe remote is S3.
But the moral of the story is, just a PC using swap memory, it's nice to not just have the operation instantly fail, but most of the time you are better off if it did, because how long it's going to take is painful.


