Partiton scanning increases if the functions like UPPER is used - snowflake-cloud-data-platform

Hi I noticed that the partition scanning increases if we use functions like UPPER in the where clause. Although it is not required to use UPPER, but I wanted to know why it changes the behavior of partition scanning. It seems like if functions are used, it forces a lot more scanning of the partitions
SELECT *
FROM SCOPS_DB.TABLE1
WHERE YEAR =2015 and UPPER(COL1)='COLVAL';
SCANS 18,759 PARTITIONS
SELECT *
FROM SCOPS_DB.TABLE1
WHERE YEAR =2015 and COL1='COLVAL';
SCANS 1 PARTITION
Thanks
Rajib

Snowflake stores metadata about each column in a metadata services layer including: "The range of values for each of the columns in the micro-partition". To me, this is like having an index on each column pointing to the the corresponding micro-partitions in your table.
In your example, this range of values stored in the metadata layer for COL1 is stored it its raw form (probably a mix of upper and lower case). This means when you apply a function to the column, the metadata services layer cannot be used as an "index" to fetch the micro-partitions for the range of rows you need.
If you need to, you can create a cluster key to the column which does apply a function. In which case it'll be used when you apply that same function as a predicate.

Related

Altering/Editing an already partitioned table

What steps to take to add additional partitions to the end of an already partitioned table in SQL Server?
Conditions:
The Partition Function is Right Range.
Table considers as a VLTB.
No DB downtime is acceptable (<10min).
Also, How to verify the partitions and rows are correctly mapped?
Addressing your questions in turn:
What steps to take to add additional partitions to the end of an already partitioned table in SQL Server?
Partitioned tables are built on partition schemes which themselves are built on partition functions. Partition functions explicitly specify partition boundaries which implicitly define the partitions. To add a new partition to the table, you need to alter the partition function to add a new partition boundary. The syntax for that is alter partition function... split. For example, let's say that you have an existing partition function on a datetime data type that defines monthly partitions.
CREATE PARTITION FUNCTION PF_Monthly(datetime)
AS RANGE RIGHT FOR VALUES (
'2022-10-01',
'2022-11-01',
'2022-12-01',
'2023-01-01'
);
Pausing there and talking about the last two partitions in the current setup. The next-to-last partition is defined as 2022-12-01 <= x < 2023-01-01 while the last partition is defined as 2023-01-01 <= x. Which is to say that the next-to-last partition is bounded for the month of December 2022, the last partition is unbounded on the high side and includes data for January 2023 but also anything larger.
If you want to bound the last partition to just January 2023, you'll add a partition boundary to the function for the high side of that partition. There's a small catch in that you'll also need to alter the partition scheme to tell SQL where to put data, but that's a small thing.
ALTER PARTITION SCHEME PS_Monthly
NEXT USED someFileGroup;
ALTER PARTITION FUNCTION PF_Monthly()
SPLIT RANGE ('2023-02-01');
At this point, what used to be your highest partition is now defined as 2023-01-01 <= x < 2023-02-01 and the highest partition is defined as 2023-02-01 <= x. I should note that adding a boundary to a partition function will affect all tables that use it. When I was using table partitioning at a previous job, I had a rule to have only one table using a given partition function (even if they were logically equivalent).
No DB downtime is acceptable (<10min)
The above exposition doesn't mention one important point - if there is data in either side of the new boundary, a new B-tree is going to be built for it (which is a size-of-data operation). There's more on that in the documentation. To keep that at a minimum, I like to keep two empty partitions at the end of the scheme. Using my above example, that would mean that I'd have added the January partition boundary in November. By doing it this way, you have some leeway in when the actual partition split happens (i.e. if it's a bit late, you're not accidentally incurring data movement). I'd also put in monitoring that's something along the lines of "if the highest partition boundary is less than 45 days away, alert". A slightly more sophisticated but more correct alert would be "if there is data in the second to last partition, send an alert".
Also, How to verify the partitions and rows are correctly mapped?
You can query the DMVs for this. I like using the script in this blog post. There's also the $PARTITION() function if you want to see which partition specific rows in your table belong to.

Is local bitmap index better or normal index, if table partition is not used in query

I have a partitioned table with a column of enumerated values (i.e. non unique), and I want to make the index on this column to enhance the performance of a certain query that doesn't include the partition in the where clause (i.e. runs on the entire table) .... is it better to make it local bitmap index or a normal index ? ... I am using oracle 12g
A bitmap index is better for a non-unique values like age,sex,location etc.
But it very well depends on the volume of data,size of DB and how frequent the updates are etc.
Below will be a good read.
Refer: http://www.oracle.com/technetwork/articles/sharma-indexes-093638.html

Is there any benefit to an index when the column is only used for (hash) joining?

Lets say that a column will only be used for joining. (i.e. I won't be ordering on the column, nor will a search for specific values in the column individually) ... the only thing that I will use the column for is joining to another table.
If the database supports Hash Joins (which from my understanding don't benefit from indexes) .. then wouldn't the addition of an index be completely redundant? (and wasteful) ?
In SQL Server it will still prevent a Key Lookup.
If you JOIN on an unindexed field, the server needs to get the values for that field from the clustered index.
If you JOIN on a NC index, the values can be obtained directly without loading all the data pages from the cluster (which really is the whole table).
So essentially you save yourself a lot of IO as the first step filters down based on a very narrow index instead of on the entire table loaded from disk.

mysql 5.1 partitioning - do I have to remove the index/key element?

I have a table with several indexes. All of them contain an specific integer column.
I'm moving to mysql 5.1 and about to partition the table by this column.
Do I still have to keep this column as key in my indexes or I can remove it since partitioning will take care of searching only in the relevant keys data efficiently without need to specify it as key?
Partition field must be part of index so the answer is that I kave to keep the partitioning column in my index.
Partitioning will only slice the values/ranges of that index into separate partitions according to how you set it up. You'd still want to have indexes on that column so the index can be used after partition pruning has been done.
Keep in mind there's a big impact on how many partitions you can have, if you have an integer column with only 4 distinct values in it, you might create 4 partitions, and an index would likely not benefit you much depending on your queries.
If you got 10000 distinct values in your integer column, you hit system limits if you try to create 10k partitions - you'll have to partition on large ranges (e.g. 0-1000,1001-2000, etc.) in such a case you'll benefit from an index (again depending on how you query the tables)

Database optimization: Hashing all the values

Typically, the databases are designed as below to allow multiple types for an entity.
Entity Name
Type
Additional info
Entity name can be something like account number and type could be like savings,current etc in a bank database for example.
Mostly, type will be some kind of string. There could be additional information associated with an entity type.
Normally queries will be posed like this.
Find account numbers of this particular type?
Find account numbers of type X, having balance greater than 1 million?
To answer these queries, query analyzer will scan the index if the index is associated with a particular column. Otherwise, it will do a full scan of all the rows.
I am thinking about the below optimization.
Why not we store the hash or integral value of each column data in the actual table such that the ordering property is maintained, so that it will be easy for comparison.
It has below advantages.
1. Table size will be lot less because we will be storing small size values for each column data.
2. We can construct a clustered B+ tree index on the hash values for each column to retrieve the corresponding rows matching or greater or smaller than some value.
3. The corresponding values can be easily retrieved by having B+ tree index in the main memory and retrieving the corresponding values.
4. Infrequent values will never need to retrieved.
I am still having more optimizations in my mind. I will post those based on the feedback to this question.
I am not sure if this is already implemented in database, this is just a thought.
Thank you for reading this.
-- Bala
Update:
I am not trying to emulate what the database does. Normally indexes are created by the database administrator. I am trying to propose a physical schema by having indexes on all the fields in the database, so that database table size is reduced and its easy to answer few queries.
Updates:(Joe's answer)
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
In a typical table, all the physical data will be stored. But now by generating a hash value on each column data, I am only storing the hash value in the actual table. I agree that its not reducing the size of the database, but its reducing the size of the table. It will be useful when you don't need to return all the column values.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
There can be only one clustered index on a table and all other indexes have to unclustered indexes. With my approach I will be having clustered index on all the values of the database. It will improve query performance.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
The basic idea is that instead of creating a separate table for each column for efficient access, we are doing it at the physical level.
Now the table will look like this.
Row1 - OrderedHash(Column1),OrderedHash(Column2),OrderedHash(Column3)
Google for "hash index". For example, in SQL Server such an index is created and queried using the CHECKSUM function.
This is mainly useful when you need to index a column which contains long values, e.g. varchars which are on average more than 100 characters or something like that.
How does adding indexes to every field reduce the size of the database? You still have to store all of the true values in addition to the hash; we don't just want to query for existence but want to return the actual data.
Most RDBMSes answer most queries efficiently now (especially with key indices in place). I'm having a hard time formulating scenarios where your database would be more efficient and save space.
Putting indexes within the physical data -- that doesn't really make sense. The key to indexes' performance is that each index is stored in sorted order. How do you propose doing that across any possible field if they are only stored once in their physical layout? Ultimately, the actual rows have to be sorted by something (in SQL Server, for example, this is the clustered index)?
I don't think your approach is very helpful.
Hash values only help for equality/inequality comparisons, but not less than/greater than comparisons, compared to pretty much every database index.
Even with (in)equality hash functions do not offer 100% guarantee of having given you the right answer, as hash collisions can happen, so you will still have to fetch and compare the original value - boom, you just lost what you wanted to save.
You can have the rows in a table ordered only one way at a time. So if you have an application where you have to order rows differently in different queries (e.g. query A needs a list of customers ordered by their name, query B needs a list of customers ordered by their sales volume), one of those queries will have to access the table out-of-order.
If you don't want the database to have to work around colums you do not use in a query, then use indexes with extra data columns - if your query is ordered according to that index, and your query only uses columns that are in the index (coulmns the index is based on plus columns you have explicitly added into the index), the DBMS will not read the original table.
Etc.

Resources