search optimization vs cluster key in snowflake - snowflake-cloud-data-platform

search optimization vs cluster key in snowflake - snowflake-cloud-data-platform

Can some explain when do we use search optimization and cluster key for table or do we use both ?
I see that we are losing credits if we enable both of them?
Thanks,
Sye

The Search Optimization is used when you need to access small number of rows (point lookup queries), like when you access an OLTP database.
Cluster Key is for partitioning your data. It's generally good for any kind of workloads unless you need to read whole table.
If you don't need to access a specific row in your large table, you don't need Search optimization service.
If your table is not large, or if you ingest "ordered" data to your table, you don't need auto-clustering (cluster keys).

When you load a table into snowflake, it creates 'micropartitions' based on the order of the rows at load time. When a SQL statement is run, the where clause is used to prune the search space of which partitions need to be scanned.
A Cluster Key in Snowflake simply reorders the data by the cluster key, so that it is co-located within the same micropartitions. This can result in massive performance improvements if your queries frequently use the the cluster key in the where clause to filter the results.
Search optimization is for finding 1 or a small number of records based on using '=' in the where clause.
So if you have a table with Product_ID, Transaction_Date, Amount.
Queries using 'Where Year(Transaction Date) >= 2017' would benefit from a cluster key on Transaction Date.
Queries using 'Where Product_ID = 111222333' would benefit from search optimization.
In either case, these are only needed of your table is large (think billions of rows). Otherwise, the native Snowflake micropartition approach will do a good job at optimization.

Please don't call Cluster Key "partitioning". Although the effect is similar, they are two distinct operations with different meanings. I will be publishing an article on partitioning and pruning shortly.

Related

Large partition count performance impact

I have tables have millions of partitions.
Should I reduce partition count for performance?
As my experience of spark application or hive query system, too many partition was bad for performance.

If you do not have auto clustering on the table, it will not be auto defragmented. So if you write to the table frequently with small row counts, it will be in very bad shape.
Partition count impacts compile time badly, as every partition has metadata that is load to plan/optimize the query. I would suggest doing a rebuild test (select into a new transient table) and run some comparable queries to see the different in compile time.
We have a number of table that sorting (thus auto clustering) does not make sense for as the use pattern is always full-table scan, thus we just rebuild those tables on schedules to keep the partition count down, and for us, that rebuild cost is worth the performance gain.
As with everything Snowflake you should run a test, and see how it is for you. And monitor hot spots as they can and do change.

In Snowflake, there are micro-partitions, and they are managed automatically. Therefore you do not need to worry about the number of micro-partition.
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#what-are-micro-partitions
It says:
Micro-partitioning is automatically performed on all Snowflake tables.
Tables are transparently partitioned using the ordering of the data as
it is inserted/loaded.
From this page, I understand that micro-partitions are managed by Snowflake, and you do not need to focus on reducing the partition count (this is the original question).
This should also help to understand the difference between clustering and micro-partitions:
https://docs.snowflake.com/en/user-guide/table-considerations.html#when-to-set-a-clustering-key
If you read the above link, you can see that it is not a must to define clustering on even large tables to get a good query performance!
As the original question about reducing the partition count, I also have to say that clustering does not always reduce the number of partitions, but it is another story.

Does Snowflake support indexes?

In the Snowflake documentation, I could not find a reference to using Indexes.
Does Snowflake support Indexes and, if not, what is the alternative approach to performance tuning when using Snowflake?

Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries. Instead, Snowflake calculates statistics about columns and records in files that you load, and uses those statistics to figure out what parts of what tables/records to actually load to execute a query. It also uses a columnar store file format, that lets it only read the parts of the table that contain the fields (columns) you actually use, and thus cut down on I/O on columns that you don't use in the query.
Snowflake slices big tables (gigabyte, terabyte or larger) into smaller "micro partitions." For each micro partition, it collects statistics about what value ranges each column contains. Then, it only loads micro partitions that contain values in the range needed by your query. As an example, let's say you have a column of time stamps. If your query asks for data between June 1 and July 1, then partitions that do not contain any data in this range, will not be loaded or processed, based on the statistics stored for dates in the micropartition files.
Indexes are often used for online transaction processing, because they accelerate workflows when you work with one or a few records, but when you run analytics queries on large datasets, you almost always work with large subsets of each table in your joins and aggregates. The storage mechanism, with automatic statistics, automatically accelerates such large queries, with no need for you to specify an index, or tune any kind of parameters.

Snowflake does not support indexes, though it does support "clustering" for performance improvements of I/O.
I recommend reading these links to get familiar with this:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html
https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html
Here's a really good blog post on the topic as well:
https://www.snowflake.com/blog/automatic-query-optimization-no-tuning/
Hope this helps...Rich

No Snowflake does not have indexes. Its performance boosts come through by eliminating unnecessary scanning which it achieves my maintaining rich metadata in each of its micro partitions. For instance if you have a time filter in your query and your table is more or less sorted by time, then Snowflake can "prune" away the parts of the table that are not relevant to the query.
Having said this, Snowflake is constantly releasing new features and one such feature is its Search Optimisation Service which allows you to perform "needle in a hay stack" queries on selected columns that you enable. Not quite indexes that you can create, but something like that being used behind the scenes perhaps.

No, Snowflake doesn't support indexes. And don't let them tell you that this is an advantage.
Performance tuning can be done as described above, but is often is done with money: Pay for bigger warehouses.

Snowflake doesn't support indexes, it keeps data in micro partition or in another sense it breaks data sets in small files and format rows to column and compress them. Snowflake metadata manager in service layer will have all the information about each micro partition like which partition have which data.
Each partition will have information about itself in header like max value, min value, cardinality etc. this is much better then indexes as compare to conventional databases.

Snowflake is a columnar database with automatic micro-partitioning. Note that in SQL Server, Microsoft call their columnar storage option a column store index.
The performance gain from columnar storage on data warehouse/mart type queries is spectacular compared with their row store brethren. By storing data by column the columns can be greatly compressed allowing a huge amount of data can be held in memory.
If your predominant queries are on a naturally ordered column, such as OrderDate then it makes sense to cluster on OrderDate. You will gain a performance benefit from doing that.
Clustering isn't a catch-all performance boost. Choose your clustering unwisely and you can degrade performance for your queries.
In terms of performance tuning there are techniques you can use.
When using a dimensional model look at the most commonly used aspects of those dimensions and look to denormalise those aspects into your fact tables to reduce the number of joins.
For example, if the queries use Week, Month and Quarter then denormalise those aspects into the fact table giving you performance concerns. The affect on storage in a column store DB is far less than in a row store DB so the cost/benefit balance is much better.
Materialised views are another way of performance tuning however these come with caveats.
The range of SQL statements available to you for materialised views is far less than for other views
Not all aggregates are supported
Can only be on a single table
They work well when data doesn't change often.
If your underlying table is clustered on OrderDate then a materialised view of last months orders might not give you the desired performance benefit because partition pruning might already be doing what is needed.
If your query performance is as a result of contention with other users then spinning up another warehouse might be the answer. 2 warehouses dedicated to their tasks might be more cost effective than scaling up a single warehouse.
Primary/unique key constraints can be defined but are metadata only despite the constraint documentation describing the enforced/not enforced syntax.
Some distributed column stores do support PK and FK constraints, Vertica being an example, but most do not because the performance impact of enforcing them is too high.

** Updated Fall 2022 - thanks to Hobo's comment: Yes, via Unistore's Hybrid Tables. **
Original Response:
Neither Snowflake nor any high-performance big data / OLAP system will support [unique] indexes because these systems are MPP (Massively Parallel Processing). MPP systems load data with thousands of concurrent inserts into the same table. [Unique] Indexes are a concept from much smaller / OLTP systems. Even then many data engineers intentionally disable the [unique] indexes on OLTP systems when they approach big data scale especially as the data is inserted or frequently updated and deleted.
If you want a "non-unique index" then you can use a slew of features such as: micro-partitions, clustered tables, auto-clustering, Search Optimization Service, etc.
This Medium can give you some workarounds. How can we enforce [Unique, Primary Key, Foreign Key (UPF)] column constraints in Snowflake?

Snowflake does not support indexing natively, but it has other ways to tune performance:
Reduce queuing by setting a time-out and/or adjusting the max concurrency
Use result caching
Tackle disk spilling
Rectify row expansion by using the distinct clause, using temporary tables and checking your join order
Fix inadequate pruning by setting up data clustering
Reference: https://rockset.com/blog/what-do-i-do-when-my-snowflake-query-is-slow-part-2-solutions/ (Disclosure: I work for Rockset).

In short,
snowflake does not support indexes but a single clustering key on a each table.

Snowflake does not support indexes but if you are looking for optimization you can use search optimization service of Snowflake.
Please refer below snowflake documentation.
https://docs.snowflake.com/en/user-guide/search-optimization-service.html

Snowflake's Search Optimization Service will create indexes over all the pertinent columns in a table "out of the box" as well as other advances search features (e.g. substring and regex matching).
If you'd like optimize for specific expressions used in your queries, you can customize SOS, as well.

SQL Server 2008 indexes - performance gain on queries vs. loss on INSERT/UPDATE

How can you determine if the performance gained on a SELECT by indexing a column will outweigh the performance loss on an INSERT in the same table? Is there a "tipping-point" in the size of the table when the index does more harm than good?
I have table in SQL Server 2008 with 2-3 million rows at any given time. Every time an insert is done on the table, a lookup is also done on the same table using two of its columns. I'm trying to determine if it would be beneficial to add indexes to the two columns used in the lookup.

Like everything else SQL-related, it depends:
What kind of fields are they? Varchar? Int? Datetime?
Are there other indexes on the table?
Will you need to include additional fields?
What's the clustered index?
How many rows are inserted/deleted in a transaction?
The only real way to know is to benchmark it. Put the index(es) in place and do frequent monitoring, or run a trace.

This depends on your workload and your requirements. Sometimes data is loaded once and read millions of times, but sometimes not all loaded data is ever read.
Sometimes reads or writes must complete in certain time.

case 1: If table is static and is queried heavily (eg: item table in Shopping Cart application) then indexes on the appropriate fields is highly beneficial.
case 2: If table is highly dynamic and not a lot of querying is done on a daily basis (eg: log tables used for auditing purposes) then indexes will slow down the writes.
If above two cases are the boundary cases, then to build indexes or not to build indexes on a table depends on which case above does the table in contention comes closest to.
If not leave it to the judgement of Query tuning advisor. Good luck.

How to deal with billions of records in an sql server?

I have an sql server 2008 database along with 30000000000 records in one of its major tables. Now we are looking for the performance for our queries. We have done with all indexes. I found that we can split our database tables into multiple partitions, so that the data will be spread over multiple files, and it will increase the performance of the queries.
But unfortunatly this functionality is only available in the sql server enterprise edition, which is unaffordable for us.
Is there any way to opimize for the query performance? For example, the query
select * from mymajortable where date between '2000/10/10' and '2010/10/10'
takes around 15 min to retrieve around 10000 records.

A SELECT * will obviously be less efficiently served than a query that uses a covering index.
First step: examine the query plan and look for and table scans and the steps taking the most effort(%)
If you don’t already have an index on your ‘date’ column, you certainly need one (assuming sufficient selectivity). Try to reduce the columns in the select list, and if ‘sufficiently’ few, add these to the index as included columns (this can eliminate bookmark lookups into the clustered index and boost performance).
You could break your data up into separate tables (say by a date range) and combine via a view.
It is also very dependent on your hardware (# cores, RAM, I/O subsystem speed, network bandwidth)
Suggest you post your table and index definitions.

First always avoid Select * as that will cause the select to fetch all columns and if there is an index with just the columns you need you are fetching a lot of unnecessary data. Using only the exact columns you need to retrieve lets the server make better use of indexes.
Secondly, have a look on included columns for your indexes, that way often requested data can be included in the index to avoid having to fetch rows.
Third, you might try to use an int column for the date and convert the date into an int. Ints are usually more effective in range searches than dates, especially if you have time information to and if you can skip the time information the index will be smaller.
One more thing to check for is the Execution plan the server uses, you can see this in management studio if you enable show execution plan in the menu. It can indicate where the problem lies, you can see which indexes it tries to use and sometimes it will suggest new indexes to add.
It can also indicate other problems, Table Scan or Index Scan is bad as it indicates that it has to scan through the whole table or index while index seek is good.
It is a good source to understand how the server works.

If you add an index on date, you will probably speed up your query due to an index seek + key lookup instead of a clustered index scan, but if your filter on date will return too many records the index will not help you at all because the key lookup is executed for each result of the index seek. SQL server will then switch to a clustered index scan.
To get the best performance you need to create a covering index, that is, include all you columns you need in the "included columns" part of your index, but that will not help you if you use the select *
another issue with the select * approach is that you can't use the cache or the execution plans in an efficient way. If you really need all columns, make sure you specify all the columns instead of the *.
You should also fully quallify the object name to make sure your plan is reusable

you might consider creating an archive database, and move anything after, say, 10-20 years into the archive database. this should drastically speed up your primary production database but retains all of your historical data for reporting needs.

What type of queries are we talking about?
Is this a production table? If yes, look into normalizing a bit more and see if you cannot go a bit further as far as normalizing the DB.
If this is for reports, including a lot of Ad Hoc report queries, this screams data warehouse.
I would create a DW with seperate pre-processed reports which include all the calculation and aggregation you could expect.
I am a bit worried about a business model which involves dealing with BIG data but does not generate enough revenue or even attract enough venture investment to upgrade to enterprise.

Is it possible to partition more than one way at a time in SQL Server?

I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.

You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.

The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.