What is the difference between Micro-partitions and Data Clustering - snowflake-cloud-data-platform

I was reading this link. and trying to understand what is the relation between both, it is quite confusing. Please explain

Much like Gokhan's answer, but I would describe it differently.
Micro-partitions:
Every time to write data to snowflake it's written to a new file, because the files are immutable. This means you have many fragments. But due to keep metadata for tables, when you query, Snowflake can prune tables known to not contain the data being looked for. Otherwise it load all data (for the columns you select) and does brute force "full table scans".
Snowflake micro-partitions have no relation to classic partitioning, except for if you are lucky you might get pruning.
Also the micro partitions are written how you load the data, and are just "split" into more partitions as your writes go over a threshold. Much like you get from WinZip/7Zip/Gzip with a max file chunk size parameter..
The next thing to note is if you update a ROW in a partition, the whole partition is rewritten. AND the order of the updated rows is not controllable and can be random (based of your table join logic).
Thus if you do many writes, or many "small" updates, you will have very bad fragmentation of your partitions, which very negatively impact compile time, as all the meta data needs to be loaded. Which they are now charging for.
This is like this because S3 is immutable file store, but this is also why you can separate compute form data. It's also how "timetravel" and "history days data retention" work, because the prior state of the table is not deleted for this time, thus why you pay for the S3 storage. Which also means watchout for your churn as you pay for all data written cumulatively for days
Data-Clustering:
Is a way to indicate how you would like the data to be ordered by. And ether the legacy manual cluster commands or the auto-clustering will rewrite the partitions to improve the clustering. Think of Norton SpeedDisk (if you are old school)
Writing to table in the order you want it clustered in (aka always have an ORDER BY on your INSERT), will improve things. But you can only have a table clustered on one set of "KEYS", thus you need to think about how your mostly use the data before clustering it. Or have multiple copies of the data with the min-sub-set/sort behavior we need (we do this).
warning: UPDATEs currently do not respect this clustering, and you can pay upwards of 4x the cost of a full table rewrite by having auto clustering running, you need to watch this as it's potentially unbounded cost.
So in short clustering is like poor man's indexes, and Snowflake is basically a massive full table scan/map/reduce processing. But it's really good at that, and when you understand how it works, it's super fun to use.

Snowflake stored data in micro-partitions. Each micro-partition contains between 50 MB and 500 MB of uncompressed data. If you are familiar with partition on traditional databases, micro-partitions are very similar to them but micro-partitions are automatically generated by Snowflake. You do not need to partition table as you need to do in traditional database systems.
Data Clustering is to distribute the data based on a clustering key into these micro-partitions. If clustering is not enabled for your table, your table will still have micro-partitions but the data will not be distributed based on a specific key.
Let's assume we have A, B, C, D unique values of X column in our table (t), and we have 5 partitions:
P1: AABBC
P2: ABDAC
P3: BBBCA
P4: CBDCC
P5: BBCCD
If we try to run "SELECT * FROM t WHERE X=A" query, Snowflake needs to read P1, P2, P3 partitions. If this table is clustered based on X column, the data will be distributed liked this (in theory):
P1: AAAAA
P2: BBBBB
P3: BBBBC
P4: CCCCC
P5: CCDDD
In this case, when we run "SELECT * FROM t WHERE X=A" query, Snowflake will need to read only P1 partition.
Micro-partitions (or partitioning) is very important when accessing a portion of data in a large table, because Snowflake can prune partitions based on your filter conditions in your query. If a right key (column) is defined for clustering, the partition pruning would be much more effective.

Related

Snowflake - Clustering

What is the best approach for clustering snowflake tables
Absolute clustering by manually reloading the tables at a certain frequency based on retrieval order
Create cluster key and turn on auto recluster but suspend it most of them, run it only at certain intervals may be by looking at the partition scanned column of the table
Thanks
Rajib
There is not general across all data use patterns that applies, and also that applies across time, as clustering that the implementation is evolving (said as an outside, but watching it change over time).
Auto clustering is just like hard drive fragmentation management. Because they are both the same idea, of locating like data near, to make read perf better. And just like disk defragmentation different usage loads/patterns make the need for clstuering/defrag more important, and some usages conflict with auto-clustering.
For example we have some tables that are written in as tight a loop as we can, and we want it clustered in a pattern that is 90% aligned with the insert order. So the auto clustering is not costly to the insert pattern. But once a month we delete from these tables GDPR/PII reasons, and after update/delete change 1/3 of partitions. So it would seem doing a full table rewrite with an ORDER applied would be overkill. But because of the insert rate auto-clustering (as it stands today) thrashes for hours and costs 5x the cost to do a full table rewrite.
Also we have other tables (the contain address information) and the table is "rather small" so is full tables scanned a lot, so ordering it in the sense of auto-cluster does not make sense, but re-build the table daily, to keep the partition size small as possible, so full tables scans are the fastest they can be.. the point being auto-clustering also does micro-partition optimization, which would be useful, but we don't need the table ordered, so are not running clustering..
Your best method is to create the initial table sorted by your cluster key, and then turn on autoclustering...and then let Snowflake handle everything for you from there.
To cut the chase for the answers.
Load the tables with sorted data/time field - which might be used to retrieve the data - Business date instead of (ETL) insert date/time. This should be good enough for most of the tables from the data retrieval performance point of view.
You can choose to do re-clustering depending upon the rate of DML operation on the table
Given you have an additional pattern for data access on the specific columns - you may consider adding clustering keys to the table - and let the auto clustering kick in.
It is always desirable to identify the access pattern sooner than later. Given that, to make sure you achieve performance data retrieval - auto clustering will re-arrange the data.
Auto - clustering will cost you credits but that will outplay for the performance that you will achieve.
Link here will help you make an informed decision.
Hope this helps!

Performance of Column Family in Cassandra DB

I have a table where my queries will be purely based on the id and created_time, I have the 50 other columns which will be queried purely based on the id and created_time, I can design it in two ways,
Either by multiple small tables with 5 column each for all 50 parameters
A single table with all 50 columns with id and created_at as primary
key
Which will be better, my rows will increase tremendously, so should I bother on the length of column family while modelling?
Actually, you need to have small tables to decrease the load on single table and should also try to maintain a query based table. If the query used contains the read statement to get all the 50 columns, then you can proceed with single table. But if you are planning to get part of data in each of your query, then you should maintain query based small tables which will redistribute the data evenly across the nodes or maintain multiple partitions as alex suggested(but you cannot get range based queries).
This really depends on how you structure of your partition key & distribution of data inside partition. CQL has some limits, like, max 2 billion cells per partitions, but this is a theoretical limit, and practical limits - something like, not having partitions bigger than 100Mb, etc. (DSE has recommendations in the planning guide).
If you'll always search by id & created_time, and not doing range queries on created_time, then you may even have the composite partition key comprising of both - this will distribute data more evenly across the cluster. Otherwise make sure that you don't have too much data inside partitions.
Or you can add another another piece into partition key, for example, sometimes people add the truncated date-time into partition key, for example, time rounded to hour, or to the day - but this will affect your queries. It's really depends on them.
Sort of in line with what Alex mentions, the determining factor here is going to be the size of your various partitions (which is an extension of the size of your columns).
Practically speaking, you can have problems going both ways - partitions that are too narrow can be as problematic as partitions that are too wide, so this is the type of thing you may want to try benchmarking and seeing which works best. I suspect for normal data models (staying away from the pathological edge cases), either will work just fine, and you won't see a meaningful difference (assuming 3.11).
In 3.11.x, Cassandra does a better job of skipping unrequested values than in 3.0.x, so if you do choose to join it all in one table, do consider using 3.11.2 or whatever the latest available release is in the 3.11 (or newer) branch.

Optimum number of rows in a table for creating indexes

My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.

Database Implementation Help : Time-Series data

This is the re-submission of my previous question:
I have a collection of ordered time-series data(stock minute price information). My current database structure using PostgreSQL is below:
symbol_table - where I keep the list of the symbols with the symbol_id as a primary key(serial).
time_table, date_table - time/date values are stored there. time_id/date_id are primary keys(serial/serial).
My main minute_table contains the minute pricing information where
date_id|time_id|symbol_id are primary keys(also foreign keys from the corresponding tables)
Using this main minute_table I'm performing different statistical analyses and keep the results in a separate tables, like one_minute_std - where one minute standard deviation measures are kept.
Every night I'm updating the tables with the current price information from the last day's closing prices.
With the current implementation my tables contain all the symbols with around 50m records each.
Primary keys are indexed.
If I want to query for all the symbols where closing price > x and one_minute_std >2 and one_minute_std < 4 for the specific date it takes about 3-4 minutes for the search.
To speed up the process I was thinking of separating each symbol to its own table but not 100% sure if this is a 'proper' way of doing it.
Could you advise me on how I can speed up the query process?
It sounds like you want a combination of approaches.
First, you should look into table partitioning. This stores a single table across multiple storage units ("files"), but still gives you the flexibility of a single table. (Here is postgres documentation http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html).
You would want to partition either by day or by ticker symbol. My first reaction would be by time (day/week/month), since that is the unit of updates. However, if you analyses are only by a single ticker and often span multiple days, then there is an argument for using that instead.
After partitioning, you may want to consider indexes. However, I suspect that partitioning will solve your performance problems.
Since your updates are at night, you should be folding in your summarization process in with the updates. For instance, one_minute_std should be calculated during this process. You might find it best to load the nightly data into a temporary table, do the calculation for summaries such as one_minute_std, and then load the data into the final partitioned table scheme.
With so many rows that have so few columns, you are probably better off with a good partitioning scheme than an indexing scheme. In particular, indexes have a space overhead, and the smaller the record in each row, the more that using the index incurs an overhead comparable to scanning the entire table.

Is it possible to partition more than one way at a time in SQL Server?

I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.

Resources