How delete works in snowflake partitions - snowflake-cloud-data-platform

I got a question on how delete works in snowflake. . As the partitions are immutable, it will create new partition when I delete records (from multiple immutable objects), my question here is what about the remaining empty space which is allotted (16mb compressed), will it keep as it is or the whole micro partitions will re-structured (re-arrange, defragment...) again?

Delete is just a special case of insert & update. Which is best seen as change.
Simon and hkandpal make some very good points. About general life cycle.
As the small end I and over stack overflow users have tested that many small change to tiny table end up with still just one partition. So as Simon mentions there seems to be some form on appending/rewriting of small partitions.
But at the big end there is very much no free lunch to large change. What we have notice if mass deleting, that the output takes less partitions, which implies if you had 50 partitions hold 5000 rows each, and you delete every odd row, you end up with 25 partitions. So the write operation is bunching the partitions.
But at the same time if you have auto clustering turned on, the delete/update write is unordered, thus we have huge clustering "re-right" costs, after deleting data. Because the filter to find the rows to delete reorders the data, and that is how it is written, and then the auto clustering can spend 5x the original write cost re-ordering the data. so in some cases it's cheaper to do an order Create Table As Select with an order by clause to "delete" 1/30th of the data for 100+ GB tables.
Other facts that point to this lack of free lunch is the auto clustering does too tasks partition defragmentation, which the support engineers have mentioned in the past that might be made it's own feature. And re-ordering. We have small tables that we rebuild daily to keep the order perfect as it has huge impact for us, verse relying on auto-clustering, as the later is happy with mostly there, but the performance impact (it a table which address information that is join to almost everything) on the table being small and in order and cacheable is meaningful to us.

When a delete or update operation is performed, Snowflake deletes the partition file and replaces it with a new file with the changes.
For example, we have stored the data of a table in we delete all the records which have a name as “Andy”. Now Snowflake deletes the entire partition file P1 and replaces it with a new partition file P3 including the changes.
Jim Partition P1
John Partition P1
Andy Partition P1
Joe Partition P2
Mike Partition P2
Jeff Partition P2
New Partition P3
Jim Partition P3
John Partition P3

Related

Postgres row_number() doubling table size roughly every 24 hours

I have an Assets table with ~165,000 rows in it. However, the Assets make up "Collections" and each Collection may have ~10,000 items, which I want to save a "rank" for so users can see where a given asset ranks within the collection.
The rank can change (based on an internal score), so it needs to be updated periodically (a few times an hour).
That's currently being done on a per-collection basis with this:
UPDATE assets a
SET rank = a2.seqnum
FROM
(SELECT a2.*,
row_number() OVER (
ORDER BY elo_rating DESC) AS seqnum
FROM assets a2
WHERE a2.collection_id = #{collection_id} ) a2
WHERE a2.id = a.id;
However, that's causing the size of the table to double (i.e. 1GB to 2GB) roughly every 24 hours.
A VACUUM FULL clears this up, but that doesn't feel like a real solution.
Can the query be adjusted to not create so much (what I assume is) temporary storage?
Running PostgreSQL 13.
Every update writes a new row version in Postgres. So (aside from TOASTed columns) updating every row in the table roughly doubles its size. That's what you observe. Dead tuples can later be cleaned up to shrink the physical size of the table - that's what VACUUM FULL does, expensively. See:
Are TOAST rows written for UPDATEs not changing the TOASTable column?
Alternatively, you might just not run VACUUM FULL and keep the table at ~ twice it's minimal physical size. If you run plain VACUUM (without FULL!) enough - and if you don't have long running transactions blocking that - Postgres will have marked dead tuples in the free-space map by the time the next UPDATE kicks in and can reuse the disk space, thus staying at ~ twice its minimal size. That's probably cheaper than shrinking and re-growing the table all the time, as the most expensive part is typically to physically grow the table. Be sure to have aggressive autovacuum settings for the table. See:
Aggressive Autovacuum on PostgreSQL
VACUUM returning disk space to operating system
Probably better yet, break out the ranking into a minimal separate 1:1 table (a.k.a. "vertical partitioning") , so that only minimal rows have to be written "a few times an hour". Probably including elo_rating you mention in the query, which seems to change at least as frequently (?).
(LEFT) JOIN to the main table in queries. While that adds considerable overhead, it may still be (substantially) cheaper. Depends on the complete picture, most importantly the average row size in table assets and the typical load apart from your costly updates.
See:
Many columns vs few tables - performance wise
UPDATE or INSERT & DELETE? Which is better for storage / performance with large text columns?

What is the difference between Micro-partitions and Data Clustering

I was reading this link. and trying to understand what is the relation between both, it is quite confusing. Please explain
Much like Gokhan's answer, but I would describe it differently.
Micro-partitions:
Every time to write data to snowflake it's written to a new file, because the files are immutable. This means you have many fragments. But due to keep metadata for tables, when you query, Snowflake can prune tables known to not contain the data being looked for. Otherwise it load all data (for the columns you select) and does brute force "full table scans".
Snowflake micro-partitions have no relation to classic partitioning, except for if you are lucky you might get pruning.
Also the micro partitions are written how you load the data, and are just "split" into more partitions as your writes go over a threshold. Much like you get from WinZip/7Zip/Gzip with a max file chunk size parameter..
The next thing to note is if you update a ROW in a partition, the whole partition is rewritten. AND the order of the updated rows is not controllable and can be random (based of your table join logic).
Thus if you do many writes, or many "small" updates, you will have very bad fragmentation of your partitions, which very negatively impact compile time, as all the meta data needs to be loaded. Which they are now charging for.
This is like this because S3 is immutable file store, but this is also why you can separate compute form data. It's also how "timetravel" and "history days data retention" work, because the prior state of the table is not deleted for this time, thus why you pay for the S3 storage. Which also means watchout for your churn as you pay for all data written cumulatively for days
Data-Clustering:
Is a way to indicate how you would like the data to be ordered by. And ether the legacy manual cluster commands or the auto-clustering will rewrite the partitions to improve the clustering. Think of Norton SpeedDisk (if you are old school)
Writing to table in the order you want it clustered in (aka always have an ORDER BY on your INSERT), will improve things. But you can only have a table clustered on one set of "KEYS", thus you need to think about how your mostly use the data before clustering it. Or have multiple copies of the data with the min-sub-set/sort behavior we need (we do this).
warning: UPDATEs currently do not respect this clustering, and you can pay upwards of 4x the cost of a full table rewrite by having auto clustering running, you need to watch this as it's potentially unbounded cost.
So in short clustering is like poor man's indexes, and Snowflake is basically a massive full table scan/map/reduce processing. But it's really good at that, and when you understand how it works, it's super fun to use.
Snowflake stored data in micro-partitions. Each micro-partition contains between 50 MB and 500 MB of uncompressed data. If you are familiar with partition on traditional databases, micro-partitions are very similar to them but micro-partitions are automatically generated by Snowflake. You do not need to partition table as you need to do in traditional database systems.
Data Clustering is to distribute the data based on a clustering key into these micro-partitions. If clustering is not enabled for your table, your table will still have micro-partitions but the data will not be distributed based on a specific key.
Let's assume we have A, B, C, D unique values of X column in our table (t), and we have 5 partitions:
P1: AABBC
P2: ABDAC
P3: BBBCA
P4: CBDCC
P5: BBCCD
If we try to run "SELECT * FROM t WHERE X=A" query, Snowflake needs to read P1, P2, P3 partitions. If this table is clustered based on X column, the data will be distributed liked this (in theory):
P1: AAAAA
P2: BBBBB
P3: BBBBC
P4: CCCCC
P5: CCDDD
In this case, when we run "SELECT * FROM t WHERE X=A" query, Snowflake will need to read only P1 partition.
Micro-partitions (or partitioning) is very important when accessing a portion of data in a large table, because Snowflake can prune partitions based on your filter conditions in your query. If a right key (column) is defined for clustering, the partition pruning would be much more effective.

Optimum number of rows in a table for creating indexes

My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.

delete millions records using partition tables?

We write daily about 1 million records into a sql server table.
Records has a insertdate and status fields, among others of course.
I need to delete records from time to time to free space on the volume but leaving the last 4 days records there.
The problem is the deletion takes hours and lots of resources.
I have think about partition tables setting the partition field on the insertdate, but I never used that kind of tables.
How can I achieve the goal using the less cpu/disk resources and having the solution the less drawbacks possible? (I assume any solution has its own drawbacks, but please explain them if you know).
There are two approaches you can take to speed up the deletion. One is to delete 10000 rows at a time so the transaction log does not grow to an enormous size. Based on some logic you keep deleting top 10000 rows until all the rows fulfill a condition are deleted. This can, depending on your system, speed up the deletes by a factor of 100.
The other approach is to create a partition on the table. You have to create a partition schema and function and if all the rows you are deleting are in one partition, let's say a days worth of sales, then deleting the partition will remove all the rows in a meta operation and it will only take a few seconds to do. Partitioning is not hard but you have to spend some time to set up a rolling window properly. It's more then an hour but less then a week.

choosing table design for database performance

I am developing a Job application which executes multiple parallel jobs. Every job will pull data from third party source and process. Minimum records are 100,000. So i am creating new table for each job (like Job123. 123 is jobId) and processing it. When job starts it will clear old records and get new records and process. Now the problem is I have 1000 jobs and the DB has 1000 tables. The DB size is drastically increased due to lots of tables.
My question is whether it is ok to create new table for each job. or have only one table called Job and have column jobId, then enter data and process it. Only problem is every job will have 100,000+ records. If we have only one table, whether DB performance will be affected?
Please let me know which approach is better.
Don't create all those tables! Even though it might work, there's a huge performance hit.
Having a big table is fine, that's what databases are for. But...I suspect that you don't need 100 million persistent records, do you? It looks like you only process one Job at a time, but it's unclear.
Edit
The database will grow to the largest size needed, but the space from deleted records is reused. If you add 100k records and delete them, over and over, the database won't keep growing. But even after the delete it will take up as much space as 100k records.
I recommend a single large table for all jobs. There should be one table for each kind of thing, not one table for each thing.
If you make the Job ID the first field in the clustered index, SQL Server will use a b-tree index to determine the physical order of data in the table. In principle, the data will automatically be physically grouped by Job ID due to the physical sort order. This may not stay strictly true forever due to fragmentation, but that would affect a multiple table design as well.
The performance impact of making the Job ID the first key field of a large table should be negligible for single-job operations as opposed to having a separate table for each job.
Also, a single large table will generally be more space efficient than multiple tables for the same amount of total data. This will improve performance by reducing pressure on the cache.

Resources