Snowflake automatic clustering - snowflake-cloud-data-platform

I have a table with 12 billion rows. I first loaded the data and then added a clustering key. Now that manual clustering is deprecated. How long will it take to automatically recluster the whole table?

Duration cannot be forecasted, but you can check the progress.
https://docs.snowflake.com/en/sql-reference/functions/automatic_clustering_history#syntax
From the Snowflake docs,
You can use SQL to view whether Automatic Clustering is enabled for a table:
SHOW TABLES command.
TABLES view (in the Snowflake Information Schema).
TABLES view (in the Account Usage shared database).
The AUTO_CLUSTERING_ON column in the output displays the Automatic Clustering status for each table, which can be used to determine whether to suspend or resume Automatic Clustering for a given table.
In addition, the CLUSTER_BY column (SHOW TABLES) or CLUSTERING_KEY column (TABLES view) displays the column(s) defined as the clustering key(s) for each table.
https://docs.snowflake.com/en/user-guide/tables-auto-reclustering#viewing-the-automatic-clustering-status-for-a-table

Related

Setting Table Auto Clustering On in snowflake is not clustering the table

I moved from manual clustering to auto clustering around 2 week back.
And the steps i used are below.
Update AUTO_CLUSTERING_ON to yes for the table.
create a middle table and insert the record in the table.
then insert into the main table with order by clustering key from the middle table.
Then i see the clustering is all over the place.
I once did the manual clustering as well and see the cluster doing good.
however on next insert in the main table. clustering again looks trouble some.
Please suggest if I am missing anything.
please note:
The data loaded in middle table is insert from some other table as well. And that table is never clustered. I am not sure if that is the issue.(which i feel it should not be)
You may need to raise a case with Snowflake to enable automatic clustering. Accounts that were created a while ago won't have this enabled. From the documentation:
If manual reclustering is still available in your account, Automatic Clustering may not be enabled yet for your account.
You can request Automatic Clustering to be enabled for your account; however, it will only affect clustered tables that are defined from the time after the feature is enabled.
For clustered tables that were defined before the feature is enabled, you must explicitly “resume” Automatic Clustering for each table. You can use SQL to determine whether Automatic Clustering is enabled for a given table.
Also from the documentation here you should try to run the resume recluster command since the table may have been created prior to automatic clustering being enabled for your account:
alter table t1 resume recluster;
Dont forget that the table gets automatically gets reclustered at Snowflake discretion. Snowflake may simply not think the table requires reclustering based on a number of factors (which I don't know :))
I think raising a case with Snowflake will probably solve this pretty quickly so that may be the best route.
Not specifically related to the question, but I have found that periodically rebuilding a table will achieve the best clustering results, especially for tables which churn frequently. To do this you can specify an ORDER BY clause which mimics your clustering keys.
CREATE OR REPLACE TABLE t1 COPY GRANTS AS
SELECT * FROM t1 ORDER BY a, b, c;

Cassandra: Why do I not have to include all partition keys in query?

Currently, I am dealing with Cassandra.
While reading a blog post, it is said:
When issuing a CQL query, you must include all partition key columns,
at a minimum.
(https://shermandigital.com/blog/designing-a-cassandra-data-model/)
However, in my database it seems like it possible without including all partition keys. Here the table:
CREATE TABLE usertable (
personid text,
name text,
"timestamp" timestamp,
active boolean,
PRIMARY KEY ((personid, name), timestamp)
) WITH
CLUSTERING ORDER BY ("timestamp" DESC)
AND comment=''
AND read_repair_chance=0
AND dclocal_read_repair_chance=0.1
AND gc_grace_seconds=864000
AND bloom_filter_fp_chance=0.01
AND compaction={ 'class':'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold':'32',
'min_threshold':'4' }
AND compression={ 'chunk_length_in_kb':'64',
'class':'org.apache.cassandra.io.compress.LZ4Compressor' }
AND caching={ 'keys':'ALL',
'rows_per_partition':'NONE' }
AND default_time_to_live=0
AND id='23ff16b0-c400-11e8-55c7-2b453518a213'
AND min_index_interval=128
AND max_index_interval=2048
AND memtable_flush_period_in_ms=0
AND speculative_retry='99PERCENTILE';
So I can do select * from usertable where personid = 'ABC-02';. However, according to the blog post, I have to include timestamp as well.
Can someone explain this?
In cassandra, partition key spreads data around cluster. It computes the hash of partition key and determine the location of data in the cluster.
One exception is, if you use ALLOW FILTERING or secondary index it does not require you too include all partition keys in where query.
For further information take a look at blog post:
The purpose of a partition key is to split the data into partitions
where an entire partition is stored on a single node in the cluster
(with each node storing many partitions). When data is read or written
from the cluster, a function called Partitioner is used to compute the
hash value of the partition key. This hash value is used to determine
the node/partition which contains that row. The clustering key is used
further to search for a row within a given partition.
Select queries in Apache Cassandra look a lot like select queries from
a relational database. However, they are significantly more
restricted. The attributes allowed in ‘where’ clause of Cassandra
query must include the full partition key and additional clauses may
only reference the clustering key columns or a secondary index of the
table being queried.
Requiring the partition key attributes in the ‘where’ helps Cassandra
to maintain constant result-set retrieval time as the cluster is
scaled-out by allowing Cassandra to determine the partition, and thus
the node (and even data files on disk), that the query must be
directed to.
If a query does not specify the values for all the columns from the
primary key in the ‘where’ clause, Cassandra will not execute it and
give the following warning :
‘InvalidRequest: Error from server: code=2200 [Invalid query]
message=”Cannot execute this query as it might involve data filtering
and thus may have unpredictable performance. If you want to execute
this query despite the performance unpredictability, use ALLOW
FILTERING” ‘
https://www.instaclustr.com/apache-cassandra-scalability-allow-filtering-partition-keys/
https://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
According to your schema, your timestamp column is the clustering column, the sorting column, no part of the partition key. That’s why it is not required.
(personid, name) are your partitions columns.

Database normalization for electricity monitoring system

I've read a lot of tips and tutorials about normalization but I still find it hard to understand how and when we need normalization. So right now I need to know if this database design for an electricity monitoring system needs to be normalized or not.
So far I have one table with fields:
monitor_id
appliance_name
brand
ampere
uptime
power_kWh
price_kWh
status (ON/OFF)
This monitoring system monitors multiple appliances (TV, Fridge, washing machine) separately.
So does it need to be normalized further? If so, how?
Honestly, you can get away without normalizing every database. Normalization is good if the database is going to be a project that affects many people or if there are performance issues and the database does OLTP. Database normalization in many ways boils down to having larger numbers of tables themselves with fewer columns. Denormalization involves having fewer tables with larger numbers of columns.
I've never seen a real database with only one table, but that's ok. Some people denormalize their database for reporting purposes. So it isn't always necessary to normalize a database.
How do you normalize it? You need to have a primary key (on a column that is unique or a combination of two or more columns that are unique in their combined form). You would need to create another table and have a foreign key relationship. A foreign key relationship is a pair of columns that exist in two or more tables. These columns need to share the same data type. These act as a map from one table to another. The tables are usually separated by real-world purpose.
For example, you could have a table with status, uptime and monitor_id. This would have a foreign key relationship to the monitor_id between the two tables. Your original table could then drop the uptime and status columns. You could have a third table with Brands, Models and the things that all models have in common (e.g., power_kWh, ampere, etc.). There could be a foreign key relationship to the first table based on model. Then the brand column could be eliminated (via the DDL command DROP) from the first table as this third table will have it relating from the model name.
To create new tables, you'll need to invoke a DDL command CREATE TABLE newTable with a foreign key on the column that will in effect be shared by the new table and the original table. With foreign key constraints, the new tables will share a column. The tables will have less information in them (fewer columns) when they are highly normalized. But there will be more tables to accommodate and store all the data. This way you can update one table and not put a lock on all the other columns in a denormalized database with one big table.
Once new tables have the data in the column or columns from the original table, you can drop those columns from the original table (except for the foreign key column). To drop columns, you need to invoke DDL commands (ALTER TABLE originalTable, drop brand).
In many ways, performance will be improved if you try to do many reads and writes (commit many transactions) on a database table in a normalized database. If you use the table as a report, and want to present all the data as it is in the table normally, normalized the database will hurt the peformance.
By the way, normalizing the database can prevent redundant data. This can make the database consume less storage space and use less memory.
It is nice to have our database normalize.It helps us to have a efficient data because we can prevent redundancy here and also saves memory usages. On normalizing tables we need to have a primary key in each table and use this to connect to another table and when the primary key (unique in each table) is on another table it is called the foreign key (use to connect to another table).
Sample you already have this table :
Table name : appliances_tbl
-inside here you have
-appliance_id : as the primary key
-appliance_name
-brand
-model
and so on about this appliances...
Next you have another table :
Table name : appliance_info_tbl (anything for a table name and must be related to its fields)
-appliance_info_id : primary key
-appliance_price
-appliance_uptime
-appliance_description
-appliance_id : foreign key (so you can get the name of the appliance by using only its id)
and so on....
You can add more table like that but just make sure that you have a primary key in each table. You can also put the cardinality to make your normalizing more understandable.

Create an unique ID in Oracle database

Is it preferred to always create an unique ID in an Oracle database? I know that the Oracle database automatically creates a field ROWID which is unique, but if I reorganize the database due to performance issuses wont all my ROWIDs be changed/restructured so the IDs I have used in other application will be broken?
When using any relational database you should create unique artificial keys if your data does not naturally contain a value or combination of values which are unique. An Oracle ROWID encodes a physical address in the database and should never be stored in a table. A ROWID is only guaranteed to be constant for the duration of a single transaction. Various types of operations on the database, including partition key updates, Flashback Table operations, and shrink table operations. See here for further information.
Share and enjoy.

SQL Server (2005) - "Deleted On" DATETIME and Indexing

I have a question related to database design. The database that I'm working with
requires data to treated in some way that it is never physically deleted. We started going
down a path of adding a "DeleteDateTime" column to some tables, that is NULL by default but
once stamped would mark a record as deleted.
This gives us the ability archive our data easily but I still feel in the dark on a few areas, specifically
whether this would be considered in line with best practices and also how to go about indexing these tables efficiently.
I'll give you an example: We have a table called "Courses" with a composite primary key made up of the columns "SiteID" and "CourseID".
This table also has a column called "DeleteDateTime" that is used in accordance with my description above.
I can't use the SQL Server 2008 filtered view feature because we have to be
SQL Server 2005 compatible. Should I include "DeleteDateTime" in the clustered index for this table? If so should it be
the first column in the index (i.e. "DeleteDateTime, SiteID, CourseID")...
Does anyone have any reasons why I should or shouldn't follow this approach?
Thanks!
Is there a chance you could transfer those "dead" records into a separate table? E.g. for your Courses table, have a Courses_deleted table or something like that, with an identical structure.
When you "delete" a record, you basically just move it to the "dead table". That way, the index on your actual, current data stays small and zippy....
If you need to have an aggregate view, you can always define a Courses_View which unions the two tables together.
Your clustered index on your real table should be as small, static and constant and possible, so I would definitely NOT recommend putting such a date time column into it. Not a good idea.
For excellent info on how to choose a good clustering key, and what it takes, check out Kimberly Tripp's blog entries:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
what's your requirements on data retention? have you looked into an audit log instead of keeping all non-current data in the database?
I think you have it right on the head for the composite indexes including your "DeleteDateTime" column.
I would create a view that is basically
select {List all columns except the delete flag}
from mytable
where deletflag is null
This is what I would use for all my queries on the table. The reason why is to prevent people from forgetting to consider the deleted flag. SQL Server 2005 can easily handle this kind of view and it is necessary if you are goin to use thisdesign for delting records. I would have a separate index on the delted column. I likely would not make it part of the clustered index.

Resources