Setting Table Auto Clustering On in snowflake is not clustering the table - snowflake-cloud-data-platform

I moved from manual clustering to auto clustering around 2 week back.
And the steps i used are below.
Update AUTO_CLUSTERING_ON to yes for the table.
create a middle table and insert the record in the table.
then insert into the main table with order by clustering key from the middle table.
Then i see the clustering is all over the place.
I once did the manual clustering as well and see the cluster doing good.
however on next insert in the main table. clustering again looks trouble some.
Please suggest if I am missing anything.
please note:
The data loaded in middle table is insert from some other table as well. And that table is never clustered. I am not sure if that is the issue.(which i feel it should not be)

You may need to raise a case with Snowflake to enable automatic clustering. Accounts that were created a while ago won't have this enabled. From the documentation:
If manual reclustering is still available in your account, Automatic Clustering may not be enabled yet for your account.
You can request Automatic Clustering to be enabled for your account; however, it will only affect clustered tables that are defined from the time after the feature is enabled.
For clustered tables that were defined before the feature is enabled, you must explicitly “resume” Automatic Clustering for each table. You can use SQL to determine whether Automatic Clustering is enabled for a given table.
Also from the documentation here you should try to run the resume recluster command since the table may have been created prior to automatic clustering being enabled for your account:
alter table t1 resume recluster;
Dont forget that the table gets automatically gets reclustered at Snowflake discretion. Snowflake may simply not think the table requires reclustering based on a number of factors (which I don't know :))
I think raising a case with Snowflake will probably solve this pretty quickly so that may be the best route.

Not specifically related to the question, but I have found that periodically rebuilding a table will achieve the best clustering results, especially for tables which churn frequently. To do this you can specify an ORDER BY clause which mimics your clustering keys.
CREATE OR REPLACE TABLE t1 COPY GRANTS AS
SELECT * FROM t1 ORDER BY a, b, c;

Related

Two identity range constraints on subscriber table - SQL Server replication

We have transactional replication with updatable subscriptions. At subscriber have have several tables with double identity constraint. (Not sure how to replicate this. Maybe it happened during one of reinitializations on re-creations of the replication)
For instance:
CHECK NOT FOR REPLICATION (([ID]>(513000) AND [ID]<(514000)))
CHECK NOT FOR REPLICATION (([ID]>(347934) AND [ID]<(360000)))
DBCC CHECKIDENT result:
Checking identity information: current identity value 'NULL', current column value '538185'.
Replication works as intended, but we want to get rid from the excessive constraint. Have no idea why current identity is NULL here. I know that we can reseed the ident so it would become within the range, but how to determine which from 2 constraints is valid and actual for replication?
For some tables it is not an issue and current ident is within one of 2 ranges, but here goes another question: how can we safely remove excessive constraint?
I believe we could remove the article from the replication, verify all constraints are removed from table, then put article back and reinitialize all subscriptions. But reinitializing isn't really good solution for us, because it would take too much time and it may harm our customers.
If we would try just to delete one of constraints - would it do any harm to the replication? Is information about constraints saved in some system tables which can cause troubles in future?
Any ideas about neat solution?

(SQL Server) How to disable foreign key validation during a single INSERT query?

I'm trying to improve the performance of a multi-row INSERT query, and the biggest factor at the moment according to the query plan is a FK validation against a large parent table.
I know that the INSERT query will not be inserting data that violates the FK, because it is an INSERT INTO ... SELECT ... FROM query where the SELECT involves an INNER JOIN to the parent table on the key columns, so it's not possible that invalid values will be present in the inserted rows.
I do not want to disable the FK globally. I don't want to open a window when other queries could potentially insert bad data, and locking the table and disabling the FK before performing the INSERT doesn't help because re-enabling the FK after the INSERT implies revalidating all the rows (WITH CHECK) before the engine will trust the FK, and both tables are large (potentially tens of millions of rows, and it's a multi-column natural key).
Is there any way in MSSQL to disable the validation of a specific foreign key just during the scope of a single INSERT query? I'm sincerely hoping (without much hope[1]) that I've just missed the documentation where that option is explained.
[1] Why would the engine trust the user to not use that option on a query that might insert bad data? It seems like that would be little more than syntactic sugar for the LOCK TABLE - DISABLE FK - INSERT - ENABLE FK - UNLOCK TABLE approach. But I have to ask just in case...
Sometimes it's the best solution. Usually not, though. No way to do it other than to disable before and reenable afterwards, though.
ALTER TABLE foo NOCHECK CONSTRAINT CK_foo_column
Then, afterwards:
ALTER TABLE foo CHECK CONSTRAINT CK_foo_column
You can, but it's a terrible, terrible, terrible decision. And there is no free lunch. At some point, you must validate the constraint - pay now or pay later. And pay later will eventually mean that your assumption (all the rows are valid) will be proved false.
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/disable-indexes-and-constraints
As #Randeep mentions, sql server does not automatically create an index to support a FK. And this can't be done for single statement or a single connection - it is global to all users and the particular table.

Table in DB for generating primary keys?

Do you ever use a separate table for "generating" artificial primary keys for DB (and why)? What I mean is to have a table with two columns, table name and current ID - with which you could get new "ID" for some table by simply locking the row with that table name, getting the current value of the key, increment it by one, and unlock the row. Why would you prefer this over standard integer identity column?
P.S. The "idea" is from Fowlers Patterns of Enterprise Application Architecture, btw...
This is called Hi/Lo assignment.
You would do this having either a trigger on INSERT on your tables getting the ID from this table and incrementing it before or after you get your ID, depending of your choice.
This is commonly used when you have to deal with multiple database engines. The autoincremental identifier in Oracle is through a SEQUENCE, which you increment with SEQUENCE.NEXTVALUE from within a BEFORE INSERT TRIGGER on your data table.
Oppositly, SQL Server has IDENTITY columns, autoincrementing natively and this is managed by the DBE itself.
In order for your software to work on both DBE, you have to come to some sort of a standard, then the most common "standard" used for this is the Hi/Lo assignment to the primary key.
This is one approach amongst others. These days, with ORM Mapping tools such as NHibernate, it is offered through configuration so that you need less to care on both the application and the database sides.
EDIT #1
Because this kind of maneuvre can't be used for a global scope, you'd have to have such a table per database, or database schema. This way, each schema is indenpendant from the other. However, data in one schema can't implicitly be moved toward another with the same key, as it would perhaps be conflicted with an already existing row.
As for a security schema, it accesses the same database as another schema or user, so no additional table should exist for specific security schema.
Whenever you can use sql server's identity or guid features, you should. However, there are a few situations where this may not be possible.
One example is that sql server only allows one identity column per table. Rarely, a table will have records that need both a private id and a public id, and a limit of one identity column means generating both as integers can be a pain. You could always use a guid for one, but you want the integer on the private id for speed and you may also want the public id to be more human readable than a guid.
In this situation, an extra table for generating the ids can make sense. However, I'd do it a bit differently. Still have two columns in the table, but make one "shadow" or "Id mapping" table for every real table. One of the columns will be your private id (unique constraint) and one will be your public id (identity with maybe an increment value of '7' or '13' or other number that's less obvious than '1').
The key difference here is that you don't want to do the locking yourself. Let sql server handle it.
The only time I have ever used this is when I had an application in BTrieve, and it didn't have an identity column. And I should also say when they tried to use this table, it caused a massive slow down when they tried to import data, because of all the extra reads and writes. My friend looked at it and rewrote how they did it to speed it up, but the moral of the story is that if you do something like this incorrectly, there can be brutal consequences.
Personally, I don't think I would ever want to do this. There is too much possibility for error. Two people try and use the same key, because they forgot to lock the table before grabbing the id. This just seems like something that should be left up to the RDBMS if at all possible. As Will brought up, it's easy to minimize this situation, but if you don't know what you are doing it can happen.
You wouldn't prefer it at all.
Whatever you gain by using the pattern or becoming DB agnostic, you'll lose in headaches, support and performance.
locking the row with that table name,
getting the current value of the key,
increment it by one, and unlock the
row
This sounds simple, doesn't it?
UPDATE TableOfId
SET Id += 1
OUTPUT Inserted.Id
WHERE Name = #Name;
In reality, its a disaster. No activity occurs in the application as a standalone operation: all operations are part of transactions. One cannot simply 'unlock' the row because the 'unlock' will actually occur only at commit time. Which means that all transactions that need an Id on a table are serialized and only one can proceed at any time. It also means that transaction that access more than one table will likely deadlock on updating the table of Ids because enforcing the 'get the next Id' update order is hard in practice.
To avoid complete serialization one needs to obtain the Ids on separate, standalone, transactions that can commit immediately (usually implicit auto-commit transaction on the UPDATE itself). But this complicates the application logic tremendously. Every operation needs to maintain two separate connections to the database, one to do the normal transaction logic and another one to obtain the needed Ids. Even then, the update of Ids can become such a hot spot that it can still cause visible contention and blocking (similar to the dreaded 'update page hit count +1' prevalent on web apps).
In short: use IDENTITY. The identity generation is optimized for high concurrency.
I have seen this pattern used when data created in one database needs to be migrated, backed-up, clustered or staged to another database. In this situation, first of all your want to ensure the primary keys will not need to change. Secondly the foreign keys. Thirdly, externally exposed keys or durable references.

SQL Server (2005) - "Deleted On" DATETIME and Indexing

I have a question related to database design. The database that I'm working with
requires data to treated in some way that it is never physically deleted. We started going
down a path of adding a "DeleteDateTime" column to some tables, that is NULL by default but
once stamped would mark a record as deleted.
This gives us the ability archive our data easily but I still feel in the dark on a few areas, specifically
whether this would be considered in line with best practices and also how to go about indexing these tables efficiently.
I'll give you an example: We have a table called "Courses" with a composite primary key made up of the columns "SiteID" and "CourseID".
This table also has a column called "DeleteDateTime" that is used in accordance with my description above.
I can't use the SQL Server 2008 filtered view feature because we have to be
SQL Server 2005 compatible. Should I include "DeleteDateTime" in the clustered index for this table? If so should it be
the first column in the index (i.e. "DeleteDateTime, SiteID, CourseID")...
Does anyone have any reasons why I should or shouldn't follow this approach?
Thanks!
Is there a chance you could transfer those "dead" records into a separate table? E.g. for your Courses table, have a Courses_deleted table or something like that, with an identical structure.
When you "delete" a record, you basically just move it to the "dead table". That way, the index on your actual, current data stays small and zippy....
If you need to have an aggregate view, you can always define a Courses_View which unions the two tables together.
Your clustered index on your real table should be as small, static and constant and possible, so I would definitely NOT recommend putting such a date time column into it. Not a good idea.
For excellent info on how to choose a good clustering key, and what it takes, check out Kimberly Tripp's blog entries:
GUIDs as PRIMARY KEYs and/or the clustering key
The Clustered Index Debate Continues...
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
what's your requirements on data retention? have you looked into an audit log instead of keeping all non-current data in the database?
I think you have it right on the head for the composite indexes including your "DeleteDateTime" column.
I would create a view that is basically
select {List all columns except the delete flag}
from mytable
where deletflag is null
This is what I would use for all my queries on the table. The reason why is to prevent people from forgetting to consider the deleted flag. SQL Server 2005 can easily handle this kind of view and it is necessary if you are goin to use thisdesign for delting records. I would have a separate index on the delted column. I likely would not make it part of the clustered index.

Tables with no Primary Key

I have several tables whose only unique data is a uniqueidentifier (a Guid) column. Because guids are non-sequential (and they're client-side generated so I can't use newsequentialid()), I have made a non-primary, non-clustered index on this ID field rather than giving the tables a clustered primary key.
I'm wondering what the performance implications are for this approach. I've seen some people suggest that tables should have an auto-incrementing ("identity") int as a clustered primary key even if it doesn't have any meaning, as it means that the database engine itself can use that value to quickly look up a row instead of having to use a bookmark.
My database is merge-replicated across a bunch of servers, so I've shied away from identity int columns as they're a bit hairy to get right in replication.
What are your thoughts? Should tables have primary keys? Or is it ok to not have any clustered indexes if there are no sensible columns to index that way?
When dealing with indexes, you have to determine what your table is going to be used for. If you are primarily inserting 1000 rows a second and not doing any querying, then a clustered index is a hit to performance. If you are doing 1000 queries a second, then not having an index will lead to very bad performance. The best thing to do when trying to tune queries/indexes is to use the Query Plan Analyzer and SQL Profiler in SQL Server. This will show you where you are running into costly table scans or other performance blockers.
As for the GUID vs ID argument, you can find people online that swear by both. I have always been taught to use GUIDs unless I have a really good reason not to. Jeff has a good post that talks about the reasons for using GUIDs: https://blog.codinghorror.com/primary-keys-ids-versus-guids/.
As with most anything development related, if you are looking to improve performance there is not one, single right answer. It really depends on what you are trying to accomplish and how you are implementing the solution. The only true answer is to test, test, and test again against performance metrics to ensure that you are meeting your goals.
[Edit]
#Matt, after doing some more research on the GUID/ID debate I came across this post. Like I mentioned before, there is not a true right or wrong answer. It depends on your specific implementation needs. But these are some pretty valid reasons to use GUIDs as the primary key:
For example, there is an issue known as a "hotspot", where certain pages of data in a table are under relatively high currency contention. Basically, what happens is most of the traffic on a table (and hence page-level locks) occurs on a small area of the table, towards the end. New records will always go to this hotspot, because IDENTITY is a sequential number generator. These inserts are troublesome because they require Exlusive page lock on the page they are added to (the hotspot). This effectively serializes all inserts to a table thanks to the page locking mechanism. NewID() on the other hand does not suffer from hotspots. Values generated using the NewID() function are only sequential for short bursts of inserts (where the function is being called very quickly, such as during a multi-row insert), which causes the inserted rows to spread randomly throughout the table's data pages instead of all at the end - thus eliminating a hotspot from inserts.
Also, because the inserts are randomly distributed, the chance of page splits is greatly reduced. While a page split here and there isnt too bad, the effects do add up quickly. With IDENTITY, page Fill Factor is pretty useless as a tuning mechanism and might as well be set to 100% - rows will never be inserted in any page but the last one. With NewID(), you can actually make use of Fill Factor as a performance-enabling tool. You can set Fill Factor to a level that approximates estimated volume growth between index rebuilds, and then schedule the rebuilds during off-peak hours using dbcc reindex. This effectively delays the performance hits of page splits until off-peak times.
If you even think you might need to enable replication for the table in question - then you might as well make the PK a uniqueidentifier and flag the guid field as ROWGUIDCOL. Replication will require a uniquely valued guid field with this attribute, and it will add one if none exists. If a suitable field exists, then it will just use the one thats there.
Yet another huge benefit for using GUIDs for PKs is the fact that the value is indeed guaranteed unique - not just among all values generated by this server, but all values generated by all computers - whether it be your db server, web server, app server, or client machine. Pretty much every modern language has the capability of generating a valid guid now - in .NET you can use System.Guid.NewGuid. This is VERY handy when dealing with cached master-detail datasets in particular. You dont have to employ crazy temporary keying schemes just to relate your records together before they are committed. You just fetch a perfectly valid new Guid from the operating system for each new record's permanent key value at the time the record is created.
http://forums.asp.net/t/264350.aspx
The primary key serves three purposes:
indicates that the column(s) should be unique
indicates that the column(s) should be non-null
document the intent that this is the unique identifier of the row
The first two can be specified in lots of ways, as you have already done.
The third reason is good:
for humans, so they can easily see your intent
for the computer, so a program that might compare or otherwise process your table can query the database for the table's primary key.
A primary key doesn't have to be an auto-incrementing number field, so I would say that it's a good idea to specify your guid column as the primary key.
Just jumping in, because Matt's baited me a bit.
You need to understand that although a clustered index is put on the primary key of a table by default, that the two concepts are separate and should be considered separately. A CIX indicates the way that the data is stored and referred to by NCIXs, whereas the PK provides a uniqueness for each row to satisfy the LOGICAL requirements of a table.
A table without a CIX is just a Heap. A table without a PK is often considered "not a table". It's best to get an understanding of both the PK and CIX concepts separately so that you can make sensible decisions in database design.
Rob
Nobody answered actual question: what are pluses/minuses of a table with NO PK NOR a CLUSTERED index.
In my opinion, if you optimize for faster inserts (especially incremental bulk-insert, e.g. when you bulk load data into a non-empty table), such a table: with NO clustered index, NO constraints, NO Foreign Keys, NO Defaults and NO Primary Key, in a database with Simple Recovery Model, is the best. Now, if you ever want to query this table (as opposed to scanning it in its entirety) you may want to add a non-clustered non-unique indexes as needed but keep them to the minimum.
I too have always heard having an auto-incrementing int is good for performance even if you don't actually use it.
A Primary Key needn't be an autoincrementing field, in many cases this just means you are complicating your table structure.
Instead, a Primary Key should be the minimum collection of attributes (note that most DBMS will allow a composite primary key) that uniquely identifies a tuple.
In technical terms, it should be the field that every other field in the tuple is fully functionally dependent upon. (If it isn't you might need to normalise).
In practice, performance issues may mean that you merge tables, and use an incrementing field, but I seem to recall something about premature optimisation being evil...
Since you are doing replication, your are correct identities are something to stear clear of. I would make your GUID a primary key but nonclustered since you can't use newsequentialid. That stikes me as your best course. If you don't make it a PK but put a unique index on it, sooner or later that may cause people who maintain the system to not understand the FK relationships properly introducing bugs.

Resources