Delta Lake - How does it decide how to auto tune? - query-optimization

I'm reading about delta lake optimization, and more specifically about file size tuning.
In "Autotune based on table size", it said that "Databricks does not autotune tables that you have tuned with a specific target size or based on a workload with frequent rewrites.", and since there is autotune by workload, I was wondering how delta decides how to autotune the table.
Thanks!

Related

What decides the number of partitions in a DynamoDB table?

I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks
Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.
The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.

Does Snowflake support indexes?

In the Snowflake documentation, I could not find a reference to using Indexes.
Does Snowflake support Indexes and, if not, what is the alternative approach to performance tuning when using Snowflake?
Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries. Instead, Snowflake calculates statistics about columns and records in files that you load, and uses those statistics to figure out what parts of what tables/records to actually load to execute a query. It also uses a columnar store file format, that lets it only read the parts of the table that contain the fields (columns) you actually use, and thus cut down on I/O on columns that you don't use in the query.
Snowflake slices big tables (gigabyte, terabyte or larger) into smaller "micro partitions." For each micro partition, it collects statistics about what value ranges each column contains. Then, it only loads micro partitions that contain values in the range needed by your query. As an example, let's say you have a column of time stamps. If your query asks for data between June 1 and July 1, then partitions that do not contain any data in this range, will not be loaded or processed, based on the statistics stored for dates in the micropartition files.
Indexes are often used for online transaction processing, because they accelerate workflows when you work with one or a few records, but when you run analytics queries on large datasets, you almost always work with large subsets of each table in your joins and aggregates. The storage mechanism, with automatic statistics, automatically accelerates such large queries, with no need for you to specify an index, or tune any kind of parameters.
Snowflake does not support indexes, though it does support "clustering" for performance improvements of I/O.
I recommend reading these links to get familiar with this:
https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html
https://docs.snowflake.net/manuals/user-guide/tables-auto-reclustering.html
Here's a really good blog post on the topic as well:
https://www.snowflake.com/blog/automatic-query-optimization-no-tuning/
Hope this helps...Rich
No Snowflake does not have indexes. Its performance boosts come through by eliminating unnecessary scanning which it achieves my maintaining rich metadata in each of its micro partitions. For instance if you have a time filter in your query and your table is more or less sorted by time, then Snowflake can "prune" away the parts of the table that are not relevant to the query.
Having said this, Snowflake is constantly releasing new features and one such feature is its Search Optimisation Service which allows you to perform "needle in a hay stack" queries on selected columns that you enable. Not quite indexes that you can create, but something like that being used behind the scenes perhaps.
No, Snowflake doesn't support indexes. And don't let them tell you that this is an advantage.
Performance tuning can be done as described above, but is often is done with money: Pay for bigger warehouses.
Snowflake doesn't support indexes, it keeps data in micro partition or in another sense it breaks data sets in small files and format rows to column and compress them. Snowflake metadata manager in service layer will have all the information about each micro partition like which partition have which data.
Each partition will have information about itself in header like max value, min value, cardinality etc. this is much better then indexes as compare to conventional databases.
Snowflake is a columnar database with automatic micro-partitioning. Note that in SQL Server, Microsoft call their columnar storage option a column store index.
The performance gain from columnar storage on data warehouse/mart type queries is spectacular compared with their row store brethren. By storing data by column the columns can be greatly compressed allowing a huge amount of data can be held in memory.
If your predominant queries are on a naturally ordered column, such as OrderDate then it makes sense to cluster on OrderDate. You will gain a performance benefit from doing that.
Clustering isn't a catch-all performance boost. Choose your clustering unwisely and you can degrade performance for your queries.
In terms of performance tuning there are techniques you can use.
When using a dimensional model look at the most commonly used aspects of those dimensions and look to denormalise those aspects into your fact tables to reduce the number of joins.
For example, if the queries use Week, Month and Quarter then denormalise those aspects into the fact table giving you performance concerns. The affect on storage in a column store DB is far less than in a row store DB so the cost/benefit balance is much better.
Materialised views are another way of performance tuning however these come with caveats.
The range of SQL statements available to you for materialised views is far less than for other views
Not all aggregates are supported
Can only be on a single table
They work well when data doesn't change often.
If your underlying table is clustered on OrderDate then a materialised view of last months orders might not give you the desired performance benefit because partition pruning might already be doing what is needed.
If your query performance is as a result of contention with other users then spinning up another warehouse might be the answer. 2 warehouses dedicated to their tasks might be more cost effective than scaling up a single warehouse.
Primary/unique key constraints can be defined but are metadata only despite the constraint documentation describing the enforced/not enforced syntax.
Some distributed column stores do support PK and FK constraints, Vertica being an example, but most do not because the performance impact of enforcing them is too high.
** Updated Fall 2022 - thanks to Hobo's comment: Yes, via Unistore's Hybrid Tables. **
Original Response:
Neither Snowflake nor any high-performance big data / OLAP system will support [unique] indexes because these systems are MPP (Massively Parallel Processing). MPP systems load data with thousands of concurrent inserts into the same table. [Unique] Indexes are a concept from much smaller / OLTP systems. Even then many data engineers intentionally disable the [unique] indexes on OLTP systems when they approach big data scale especially as the data is inserted or frequently updated and deleted.
If you want a "non-unique index" then you can use a slew of features such as: micro-partitions, clustered tables, auto-clustering, Search Optimization Service, etc.
This Medium can give you some workarounds. How can we enforce [Unique, Primary Key, Foreign Key (UPF)] column constraints in Snowflake?
Snowflake does not support indexing natively, but it has other ways to tune performance:
Reduce queuing by setting a time-out and/or adjusting the max concurrency
Use result caching
Tackle disk spilling
Rectify row expansion by using the distinct clause, using temporary tables and checking your join order
Fix inadequate pruning by setting up data clustering
Reference: https://rockset.com/blog/what-do-i-do-when-my-snowflake-query-is-slow-part-2-solutions/ (Disclosure: I work for Rockset).
In short,
snowflake does not support indexes but a single clustering key on a each table.
Snowflake does not support indexes but if you are looking for optimization you can use search optimization service of Snowflake.
Please refer below snowflake documentation.
https://docs.snowflake.com/en/user-guide/search-optimization-service.html
Snowflake's Search Optimization Service will create indexes over all the pertinent columns in a table "out of the box" as well as other advances search features (e.g. substring and regex matching).
If you'd like optimize for specific expressions used in your queries, you can customize SOS, as well.

How big is too big size for table in AWS Redshift

Currently, one of our table size is 500 Million rows (with 35 columns), and we are trying to determine, how big can our table be before it impacts our performance on running query on that table?
Performance cannot be measured like rows*columns.
It depends on the data types, joins, aggregations, etc. Your query performance can be drastically improved, for example, by creating int keys (adding columns) instead of char/varchar keys if used in joins.
An important addition to #vtuhtan 's answer : enable compression. Create tables with compression enabled for various data types - lzo, runlength etc. Proper compression type is also suggested by Redshif on tables with ANALYZE COMPRESSION SQL command. This reduces the read throughput and drastically increases your query performance. This will also make the table consume less storage space.
Doc on analyzing compression enabled tables
Loading tables with compression.

Low cost way to host a large table yet keep the performance scalable?

I have a growing table storing time series data, 500M entries now, and 200K new records every day. The total size is around 15GB for now.
My clients are querying the table via a PHP script mostly, and the size of the result set is around 10K records (not very large).
select * from T where timestamp > X and timestamp < Y and additionFilters
And I want this operation cheap.
Currently my table is hosting in Postgres 7, on a single 16G memory Box, and I would love to see some good suggestion for me to host this in low cost and also allow me to scale up for performance if needed.
The table serves:
1. Query: 90%
2. Insert: 9.9%
2. Update: 0.1% <-- very rare.
PostgreSQL 9.2 supports partitioning and partial indexes. If there are a few hot partitions, and you can put those partitions or their indexes on a solid state disk, you should be able to run rings around your current configuration.
There may or may not be a low cost, scalable option. It depends on what low cost and scalable mean to you.

Is it possible to partition more than one way at a time in SQL Server?

I'm considering various ways to partition my data in SQL Server. One approach I'm looking at is to partition a particular huge table into 8 partitions, then within each of these partitions to partition on a different partition column. Is this even possible in SQL Server, or am I limited to definining one parition column+function+scheme per table?
I'm interested in the more general answer, but this strategy is one I'm considering for Distributed Partitioned View, where I'd partition the data under the first scheme using DPV to distribute the huge amount of data over 8 machines, and then on each machine partition that portion of the full table on another parition key in order to be able to drop (for example) sub-paritions as required.
You are incorrect that the partitioning key cannot be computed. Use a computed, persisted column for the key:
ALTER TABLE MYTABLE ADD PartitionID AS ISNULL(Column1 * Column2,0) persisted
I do it all the time, very simple.
The DPV across a set of Partitioned Tables is your only clean option to achieve this, something like a DPV across tblSales2007, tblSales2008, tblSales2009, and then each of the respective sales tables are partitioned again, but they could then be partitioned by a different key. There are some very good benefits in doing this in terms of operational resiliance (one partitioned table going offline does not take the DPV down - it can satisfy queries for the other timelines still)
The hack option is to create an arbitary hash of 2 columns and store this per record, and partition by it. You would have to generate this hash for every query / insertion etc since the partition key can not be computed, it must be a stored value. It's a hack and I suspect would lose more performance than you would gain.
You do have to be thinking of specific management issues / DR over data quantities though, if the data volumes are very large and you are accessing it in a primarily read mechanism then you should look into SQL 'Madison' which will scale enormously in both number of rows as well as overall size of data. But it really only suits the 99.9% read type data warehouse, it is not suitable for an OLTP.
I have production data sets sitting in the 'billions' bracket, and they reside on partitioned table systems and provide very good performance - although much of this is based on the hardware underlying a system, not the database itself. Scalaing up to this level is not an issue and I know of other's who have gone well beyond those quantities as well.
The max partitions per table remains at 1000, from what I remember of a conversation about this, it was a figure set by the testing performed - not a figure in place due to a technical limitation.

Resources