Clustering Average depth of table vs partition - snowflake-cloud-data-platform

select system$clustering_depth('Table1','(Column1)');
It gives the average depth of the table according to the specified columns :in my case the value is 17501.1143.Which tells that this table is badly clustered .
select SYSTEM$CLUSTERING_INFORMATION('Table1','(Column1)');
Average overlap depth of each micro-partition in the table. : in my case the value is 16033 which tells that the table is badly clustered.
Question :1 The first value is for a table (17501.1143)and second value(16033) is for a partition as per the snowflake documentation .
which one we should consider in order to analyse clustering for Table1?
Question :2 Theoretically they both represents the same ?if so ,why each of them has different values?

Question #1: Both. You want both of those numbers to get as close to 1 as possible for your clustering key. If you are specifying a column on the table that isn't actually the cluster key, then it'll likely be badly clustered unless things were loaded in order of that column.
Question #2: I highly recommend reading this portion of the Snowflake Documentation to understand the difference between overlap and depth. It shows a nice illustration. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html#clustering-depth

Related

Indices on composite primary key

I googled this a lot many times but I didn't get the exact explanation for the same.
I am working on a complex database structures (in Oracle 10g) where I hardly have a primary key on one single column except for the static tables.
Now my question is consider a composite primary key ID (LXI, VCODE, IVID, GHID). Since it's a primary key, Oracle will provide a default index.
Will I get ONE (system generated) single index for the primary key itself or for its sub-columns also?
Asking this because I am retrieving data (around millions of records) based on individual columns as well. Now if system generates the indices for the individual columns as well. Why my query runs pretty faster than how it actually runs when I explicitly define indices for each individual column.
Please give a satisfactory answer
Thanks in advance
A primary key is a non-NULL unique key. In your case, the unique index has four columns, LXI, VCODE, IVID GHID in the order of declaration.
If you have a condition on VCODE but not on LXI, then most databases would not use the index. Oracle has a special type of index scan called the "skip scan", which allows for this very situation. It is described in the documentation.
I would expect an index skip scan to be a bit slower than an index range scan on individual columns. However, which is better might also depend on the complexity of the where clause. For instance, three equality conditions on VCODE, IVID and GHID connected by AND might be a great example for the skip scan. And, such an index would cover the WHERE clause -- a great efficiency -- and better than one-column indexes.
As a note: index skip scans were introduced in Oracle 9i, so they are available in Oracle 10.
It will not generate index for individual column. it will generate a composite index
first it will index on LXI
then next column like that it will be a tree structure.
if you search on 1st column of primary key it will use index to use index for second you have to combine it with the first column
ex : select where ...LXI=? will use index PK
select where LXI=? and VCODE=? alse use pk
but select where VCODE=? will not use it (without LXI)

Not Getting Partition Elimination on a Foreign Key Join in SQL Server

I have a rather large fact table in SQL Server that is partitioned by a foreign key to a date dimension. The foreign key constraint is both enabled and trusted. When I add something like this to the where clause:
"F_ClinicInvoiceTransaction".ServiceDateKey>=40908 and "F_ClinicInvoiceTransaction".ServiceDateKey<42247
I get partition elimination. However when I simply join on the ServiceDateKey and filter on a date range as such:
"D_Calendar"."CalendarKey" ="F_ClinicInvoiceTransaction"."ServiceDateKey"
AND "D_Calendar".StartDT>='2012-01-01' and "D_Calendar".StartDT<'2015-10-01'
The partition elimination goes away. Is there a way to get partition elimination based on this join or am I stuck filtering explicitly on values in the fact table?
It really is much easier to answer these questions when you give more details -- but I will try and answer as best I can:
Perform a sub-query with your 2nd filter ("D_Calendar".StartDT>='2012-01-01' and "D_Calendar".StartDT<'2015-10-01' ) and just get Min and Max values of ServiceDateKey.
Use the min and max values of ServiceDateKey to perform your full query, now that it has min and max values it can do partition elimination. Using those values in the where clause like your first query does.
While it seems like doing these two steps will be slower it is often the case that the partition elimination will give faster results. Esp. with big data sets.
No, you cannot gain partition elimination by filtering on another column in another table. It needs to be on the actual partitioned column in the table and needs to be filered in the where clause. Found the answer here: https://dba.stackexchange.com/questions/21770

Joining on a non-PK field, does length of varchar datatype determine query speed? SQL Server 2008

I was given a ragtag assortment of data to analyze and am running into a predicament. I've got a ~2 million row table with a non-unique identifier of datatype varchar(50). This identifier is unique to a personID. Until I figure out exactly how I need to normalize this junk I've got another question that might help me right now: If I change the datatype to a varchar(25) for instance, will that help queries run faster when they're joined on a non-PK field? All of the characters in the string are integers, but trying to convert them to an int would cause overflow. Or could I possibly somehow index the column for the time being to get some of the queries to run faster?
EDIT: The personID will be a foreign key to another table with demographic information about a person.
Technically, the length of a varchar specifies it's maximum length.
The actual length is variable (thus the name) so a lower maximum value won't change the evaluation because it will be made on the actual string.
For more information :
Check this MSDN article and this
Stack overflow Post
Varchar(50) to varchar(25) would certainly reduce the size of record in that table, thereby reducing the number of database pages that contain the table, improving the perfomance of queries (may be to a marginal extent), but such an ALTER TABLE statement might take a long time.
Alternatively, if you define index on the join columns, and if your retrieval list is small, you can include those columns also in the index definition (Covering index), that too would bring down the query execution times significantly.

Difference between clustered and nonclustered index [duplicate]

This question already has answers here:
What are the differences between a clustered and a non-clustered index?
(13 answers)
Closed 7 years ago.
I need to add proper index to my tables and need some help.
I'm confused and need to clarify a few points:
Should I use index for non-int columns? Why/why not
I've read a lot about clustered and non-clustered index yet I still can't decide when to use one over the other. A good example would help me and a lot of other developers.
I know that I shouldn't use indexes for columns or tables that are often updated. What else should I be careful about and how can I know that it is all good before going to test phase?
A clustered index alters the way that the rows are stored. When you create a clustered index on a column (or a number of columns), SQL server sorts the table’s rows by that column(s). It is like a dictionary, where all words are sorted in alphabetical order in the entire book.
A non-clustered index, on the other hand, does not alter the way the rows are stored in the table. It creates a completely different object within the table that contains the column(s) selected for indexing and a pointer back to the table’s rows containing the data. It is like an index in the last pages of a book, where keywords are sorted and contain the page number to the material of the book for faster reference.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way!
One rule of thumb I would apply is this: any "regular" table (one that you use to store data in, that is a lookup table etc.) should have a clustering key. There's really no point not to have a clustering key. Actually, contrary to common believe, having a clustering key actually speeds up all the common operations - even inserts and deletes (since the table organization is different and usually better than with a heap - a table without a clustering key).
Kimberly Tripp, the Queen of Indexing has a great many excellent articles on the topic of why to have a clustering key, and what kind of columns to best use as your clustering key. Since you only get one per table, it's of utmost importance to pick the right clustering key - and not just any clustering key.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!
Marc
You should be using indexes to help SQL server performance. Usually that implies that columns that are used to find rows in a table are indexed.
Clustered indexes makes SQL server order the rows on disk according to the index order. This implies that if you access data in the order of a clustered index, then the data will be present on disk in the correct order. However if the column(s) that have a clustered index is frequently changed, then the row(s) will move around on disk, causing overhead - which generally is not a good idea.
Having many indexes is not good either. They cost to maintain. So start out with the obvious ones, and then profile to see which ones you miss and would benefit from. You do not need them from start, they can be added later on.
Most column datatypes can be used when indexing, but it is better to have small columns indexed than large. Also it is common to create indexes on groups of columns (e.g. country + city + street).
Also you will not notice performance issues until you have quite a bit of data in your tables. And another thing to think about is that SQL server needs statistics to do its query optimizations the right way, so make sure that you do generate that.
A comparison of a non-clustered index with a clustered index with an example
As an example of a non-clustered index, let’s say that we have a non-clustered index on the EmployeeID column. A non-clustered index will store both the value of the
EmployeeID
AND a pointer to the row in the Employee table where that value is actually stored. But a clustered index, on the other hand, will actually store the row data for a particular EmployeeID – so if you are running a query that looks for an EmployeeID of 15, the data from other columns in the table like
EmployeeName, EmployeeAddress, etc
. will all actually be stored in the leaf node of the clustered index itself.
This means that with a non-clustered index extra work is required to follow that pointer to the row in the table to retrieve any other desired values, as opposed to a clustered index which can just access the row directly since it is being stored in the same order as the clustered index itself. So, reading from a clustered index is generally faster than reading from a non-clustered index.
In general, use an index on a column that's going to be used (a lot) to search the table, such as a primary key (which by default has a clustered index). For example, if you have the query (in pseudocode)
SELECT * FROM FOO WHERE FOO.BAR = 2
You might want to put an index on FOO.BAR. A clustered index should be used on a column that will be used for sorting. A clustered index is used to sort the rows on disk, so you can only have one per table. For example if you have the query
SELECT * FROM FOO ORDER BY FOO.BAR ASCENDING
You might want to consider a clustered index on FOO.BAR.
Probably the most important consideration is how much time your queries are taking. If a query doesn't take much time or isn't used very often, it may not be worth adding indexes. As always, profile first, then optimize. SQL Server Studio can give you suggestions on where to optimize, and MSDN has some information1 that you might find useful
faster to read than non cluster as data is physically storted in index order
we can create only one per table.(cluster index)
quicker for insert and update operation than a cluster index.
we can create n number of non cluster index.

Approaches to table partitioning in SQL Server

The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.

Resources