What is the best approach for clustering snowflake tables
Absolute clustering by manually reloading the tables at a certain frequency based on retrieval order
Create cluster key and turn on auto recluster but suspend it most of them, run it only at certain intervals may be by looking at the partition scanned column of the table
Thanks
Rajib
There is not general across all data use patterns that applies, and also that applies across time, as clustering that the implementation is evolving (said as an outside, but watching it change over time).
Auto clustering is just like hard drive fragmentation management. Because they are both the same idea, of locating like data near, to make read perf better. And just like disk defragmentation different usage loads/patterns make the need for clstuering/defrag more important, and some usages conflict with auto-clustering.
For example we have some tables that are written in as tight a loop as we can, and we want it clustered in a pattern that is 90% aligned with the insert order. So the auto clustering is not costly to the insert pattern. But once a month we delete from these tables GDPR/PII reasons, and after update/delete change 1/3 of partitions. So it would seem doing a full table rewrite with an ORDER applied would be overkill. But because of the insert rate auto-clustering (as it stands today) thrashes for hours and costs 5x the cost to do a full table rewrite.
Also we have other tables (the contain address information) and the table is "rather small" so is full tables scanned a lot, so ordering it in the sense of auto-cluster does not make sense, but re-build the table daily, to keep the partition size small as possible, so full tables scans are the fastest they can be.. the point being auto-clustering also does micro-partition optimization, which would be useful, but we don't need the table ordered, so are not running clustering..
Your best method is to create the initial table sorted by your cluster key, and then turn on autoclustering...and then let Snowflake handle everything for you from there.
To cut the chase for the answers.
Load the tables with sorted data/time field - which might be used to retrieve the data - Business date instead of (ETL) insert date/time. This should be good enough for most of the tables from the data retrieval performance point of view.
You can choose to do re-clustering depending upon the rate of DML operation on the table
Given you have an additional pattern for data access on the specific columns - you may consider adding clustering keys to the table - and let the auto clustering kick in.
It is always desirable to identify the access pattern sooner than later. Given that, to make sure you achieve performance data retrieval - auto clustering will re-arrange the data.
Auto - clustering will cost you credits but that will outplay for the performance that you will achieve.
Link here will help you make an informed decision.
Hope this helps!
Related
I have tables have millions of partitions.
Should I reduce partition count for performance?
As my experience of spark application or hive query system, too many partition was bad for performance.
If you do not have auto clustering on the table, it will not be auto defragmented. So if you write to the table frequently with small row counts, it will be in very bad shape.
Partition count impacts compile time badly, as every partition has metadata that is load to plan/optimize the query. I would suggest doing a rebuild test (select into a new transient table) and run some comparable queries to see the different in compile time.
We have a number of table that sorting (thus auto clustering) does not make sense for as the use pattern is always full-table scan, thus we just rebuild those tables on schedules to keep the partition count down, and for us, that rebuild cost is worth the performance gain.
As with everything Snowflake you should run a test, and see how it is for you. And monitor hot spots as they can and do change.
In Snowflake, there are micro-partitions, and they are managed automatically. Therefore you do not need to worry about the number of micro-partition.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
It says:
Micro-partitioning is automatically performed on all Snowflake tables.
Tables are transparently partitioned using the ordering of the data as
it is inserted/loaded.
From this page, I understand that micro-partitions are managed by Snowflake, and you do not need to focus on reducing the partition count (this is the original question).
This should also help to understand the difference between clustering and micro-partitions:
https://docs.snowflake.com/en/user-guide/table-considerations.html#when-to-set-a-clustering-key
If you read the above link, you can see that it is not a must to define clustering on even large tables to get a good query performance!
As the original question about reducing the partition count, I also have to say that clustering does not always reduce the number of partitions, but it is another story.
We are designing a table for ad-hoc analysis that will capture umpteen value fields over time for claims received. The table structure is essentially (pseudo-ish-code):
table_huge (
claim_key int not null,
valuation_date_key int not null,
value_1 some_number_type,
value_2 some_number_type,
[etc...],
constraint pk_huge primary key (claim_key, valuation_date_key)
);
All value fields all numeric. The requirements are: The table shall capture a minimum of 12 recent years (hopefully more) of incepted claims. Each claim shall have a valuation date for each month-end occurring between claim inception and the current date. Typical claim inception volumes range from 50k-100k per year.
Adding all this up I project a table with a row count on the order of 100 million, and could grow to as much as 500 million over years depending on the business's needs. The table will be rebuilt each month. Consumers will select only. Other than a monthly refresh, no updates, inserts or deletes will occur.
I am coming at this from the business (consumer) side, but I have an interest in mitigating the IT cost while preserving the analytical value of this table. We are not overwhelmingly concerned about quick returns from the Table, but will occasionally need to throw a couple dozen queries at it and get all results in a day or three.
For argument's sake, let's assume the technology stack is, I dunno, in the 80th percentile of modern hardware.
The questions I have are:
Is there a point at which the cost-to-benefit of indices becomes excessive, considering a low frequency of queries against high-volume tables?
Does the SO community have experience with +100M row tables and can
offer tips on how to manage?
Do I leave the database technology problem to IT to solve or should I
seriously consider curbing the business requirements (and why?)?
I know these are somewhat soft questions, and I hope readers appreciate this is not a proposition I can test before building.
Please let me know if any clarifications are needed. Thanks for reading!
First of all: Expect this to "just work" if leaving the tech problem to IT - especially if your budget allows for an "80% current" hardware level.
I do have experience with 200M+ rows in MySQL on entry-level and outdated hardware, and I was allways positivly suprised.
Some Hints:
On monthly refresh, load the table without non-primary indices, then create them. Search for the sweet point, how many index creations in parallell work best. In a project with much less date (ca. 10M) this reduced load time compared to the naive "create table, then load data" approach by 70%
Try to get a grip on the number and complexity of concurrent queries: This has influence on your hardware decisions (less concurrency=less IO, more CPU)
Assuming you have 20 numeric fields of 64 bits each, times 200M rows: If I can calculate correctly, ths is a payload of 32GB. Trade cheap disks against 64G RAM and never ever have an IO bottleneck.
Make sure, you set the tablespace to read only
You could consider anchor modeling approach to store changes only.
Considering that there are so many expected repeated rows, ~ 95% --
bringing row count from 100M to only 5M, removes most of your concerns.
At this point it is mostly cache consideration, if the whole table
can somehow fit into cache, things happen fairly fast.
For "low" data volumes, the following structure is slower to query than a plain table; at one point (as data volume grows) it becomes faster. That point depends on several factors, but it may be easy to test. Take a look at this white-paper about anchor modeling -- see graphs on page 10.
In terms of anchor-modeling, it is equivalent to
The modeling tool has automatic code generation, but it seems that it currenty fully supports only MS SQL server, though there is ORACLE in drop-down too. It can still be used as a code-helper.
In terms of supporting code, you will need (minimum)
Latest perspective view (auto-generated)
Point in time function (auto-generated)
Staging table from which this structure will be loaded (see tutorial for data-warehouse-loading)
Loading function, from staging table to the structure
Pruning functions for each attribute, to remove any repeating values
It is easy to create all this by following auto-generated-code patterns.
With no ongoing updates/inserts, an index NEVER has negative performance consequences, only positive (by MANY orders of magnitude for tables of this size).
More critically, the schema is seriously flawed. What you want is
Claim
claim_key
valuation_date
ClaimValue
claim_key (fk->Claim.claim_key)
value_key
value
This is much more space-efficient as it stores only the values you actually have, and does not require schema changes when the number of values for a single row exceeds the number of columns you have allocated.
Using partition concept & apply partition key on every query that you perform will save give the more performance improvements.
In our company we solved huge number of performance issues with the partition concept.
One more design solutions is if we know that the table is going to be very very big, try not to apply more constraints on the table & handle in the logic before u perform & don't have many columns on the table to avoid row chaining issues.
I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.
I have an app, which cycles through a huge number of records in a database table and performs a number of SQL and .Net operations on records within that database (currently I am using Castle.ActiveRecord on PostgreSQL).
I added some basic btree indexes on a couple of the feilds, and as you would expect, the performance of the SQL operations increased substantially. Wanting to make the most of dbms performance I want to make some better educated choices about what I should index on all my projects.
I understand that there is a detrement to performance when doing inserts (as the database needs to update the index, as well as the data), but what suggestions and best practices should I consider with creating database indexes? How do I best select the feilds/combination of fields for a set of database indexes (rules of thumb)?
Also, how do I best select which index to use as a clustered index? And when it comes to the access method, under what conditions should I use a btree over a hash or a gist or a gin (what are they anyway?).
Some of my rules of thumb:
Index ALL primary keys (I think most RDBMS do this when the table is created).
Index ALL foreign key columns.
Create more indexes ONLY if:
Queries are slow.
You know the data volume is going to increase significantly.
Run statistics when populating a lot of data in tables.
If a query is slow, look at the execution plan and:
If the query for a table only uses a few columns, put all those columns into an index, then you can help the RDBMS to only use the index.
Don't waste resources indexing tiny tables (hundreds of records).
Index multiple columns in order from high cardinality to less. This means: first index the columns with more distinct values, followed by columns with fewer distinct values.
If a query needs to access more than 10% of the data, a full scan is normally better than an index.
Here's a slightly simplistic overview: it's certainly true that there is an overhead to data modifications due to the presence of indexes, but you ought to consider the relative number of reads and writes to the data. In general the number of reads is far higher than the number of writes, and you should take that into account when defining an indexing strategy.
When it comes to which columns to index I'v e always felt that the designer ought to know the business well enough to be able to take a very good first pass at which columns are likely to benefit. Other then that it really comes down to feedback from the programmers, full-scale testing, and system monitoring (preferably with extensive internal metrics on performance to capture long-running operations),
As #David Aldridge mentioned, the majority of databases perform many more reads than they do writes and in addition, appropriate indexes will often be utilised even when performing INSERTS (to determine the correct place to INSERT).
The critical indexes under an unknown production workload are often hard to guess/estimate, and a set of indexes should not be viewed as set once and forget. Indexes should be monitored and altered with changing workloads (that new killer report, for instance).
Nothing beats profiling; if you guess your indexes, you will often miss the really important ones.
As a general rule, if I have little idea how the database will be queried, then I will create indexes on all Foriegn Keys, profile under a workload (think UAT release) and remove those that are not being used, as well as creating important missing indexes.
Also, make sure that a scheduled index maintenance plan is also created.
I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.