Can the clustering keys be defined on temporary and transient tables? - snowflake-cloud-data-platform

I understand that the clustering can be defined on a permanent table. But, Can the same be defined on temporary and transient table?

Have you tried to test it?
create or replace temporary table x_temp (id number, v varchar ) cluster by (id) as select UNIFORM(0,100,random()), 'xyz' from table(generator(rowcount=>10000000));
create or replace transient table x_transient (id number, v varchar ) cluster by (id) as select UNIFORM(0,100,random()), 'xyz' from table(generator(rowcount=>10000000));
select system$clustering_information( 'x_temp');
select system$clustering_information( 'x_transient');
You may define clustering keys for transient and temp tables. Although it's possible to define a clustering key for a temporary table, I do not think it will provide any benefits, as the clustering service will not be able to re-cluster it.
New sample based on the comment:
create or replace transient table x_transient_v2 (id number, v varchar ) cluster by (id)
as select UNIFORM(0,10,random()), RANDSTR( 1000, random()) from table(generator(rowcount=>500000));
select system$clustering_information( 'x_transient_v2'); -- save the output to compare
insert into x_transient_v2 select UNIFORM(0,10,random()), RANDSTR( 1000, random()) from table(generator(rowcount=>50000));
update x_transient_v2 set v = RANDSTR( 1000, random()) where v ilike 'A%';
insert into x_transient_v2 select UNIFORM(0,10,random()), RANDSTR( 1000, random()) from table(generator(rowcount=>50000));
select system$clustering_information( 'x_transient_v2'); -- compare the output with the first one
select *
from table(information_schema.automatic_clustering_history(
date_range_start=>dateadd(d, -1, current_date),
date_range_end=>current_date,
table_name=>'X_TRANSIENT_V2')); -- wait for a while until you see a row which has NUM_ROWS_RECLUSTERED is not zero
select system$clustering_information( 'x_transient_v2'); -- compare the output which should be similar to first

Technically yes you can ALTER CLUSTER BY a transient or temporary table.
But i'm not sure that the induced cost and expected performance will suit your use case.
You should read the Snowflake documention on clustered tables because clustering keys are not intended for all tables. You have to consider table size and performance.

Related

How can a constraint be violated when the SQL query excludes it?

I have a result table that holds the output of a large, complicated, slow running query.
It's defined something like:
create table ResultsStore (
Foo int not null,
Bar int not null,
... other fields
constraint [PK_ResultsStore_foo_bar] primary key clustered
(
Foo asc,
Bar asc
)
)
I then insert to this table with a query like:
insert ResultsStore (Foo, Bar)
output inserted.*
select subQuery.ID, #bar
from (
-- large complex slow query
) subQuery
where subQuery.ID not in (
select Foo
from ResultsStore
where Bar = #bar
)
In testing this is fine, but in production, with lots of users hitting it regularly, we often get an exception:
Violation of PRIMARY KEY constraint 'PK_ResultsStore_foo_bar'. Cannot insert duplicate key in object 'ResultsStore'.
How is this possible? Surely the where should exclude any combination of the multiple primary key fields where they are already in the table?
How to best avoid this?
As written two sessions can run the query, both checking for the existence of the row concurrently, both not finding it, then both proceeding to attempt the insert. The first one will succeed in READ COMMITED, and the second one will fail.
You need WITH (UPDLOCK, HOLDLOCK, ROWLOCK) on the subquery to avoid this race condition. At default read committed isolation level either S locks taken by the sub query or row versioning is used and no locks at all are taken.
The HOLDLOCK gives serializable semantics and protects the range. UPDLOCK forces the read to use a U lock which will block other sessions from reading with UPDLOCK.
You can also use a temp table to hold interim results and perform the final insert at the end.
The following also includes a DISTINCT (which might or might not be needed), changes the dup test to use EXISTS, and applies WITH (UPDLOCK, HOLDLOCK, ROWLOCK) options to the final insert as suggested by others.
declare #TempResults table (
Foo int not null,
Bar int not null
)
insert #TempResults
select distinct subQuery.ID, #bar
from (
-- large complex slow query
) subQuery
insert ResultsStore (Foo, Bar)
output inserted.*
select T.Foo, T.Bar
from #TempResults T
where not exists (
select *
from ResultsStore RS with (updlock, holdlock, rowlock)
where RS.Foo = T.Foo
and RS.Bar = T.Bar
)
This lets your long running query run fast and dirty (as you intend), but should maintain integrity and minimize actual lock duration for the final insert.

(Alembic, SQLAlchemy) Can I copy data from non partitioned key to a partitioned one in the migration script?

I have a table needs to be partitioned, but since the postgresql_partition_by wasn't added while the creation of the table so am trying to:
create a new partitioned table that is similar the origin one.
moving the data from the old one to the new one.
drop the original one.
rename the new one.
so what is the best-practice to move the data from the old table to the new one ??
I tried this and it didn't work
COPY partitioned_table
FROM original_table;
also tried
INSERT INTO partitioned_table (column1, column2, ...)
SELECT column1, column2, ...
FROM original_table;
but both didn't work :(
noting that I am using Alembic to generate the migration scripts also am using sqlalchemy from Python
Basically you have two scenarios described below.
- The table is large and you need to split the data in several partitions
- The table gets the first partition and you add new partition for new data
Lets use this setup for the not partitioned table
create table jdbn.non_part
(id int not null, name varchar(100));
insert into jdbn.non_part (id,name)
SELECT id, 'xxxxx'|| id::varchar(20) name
from generate_series(1,1000) id;
The table contains id from 1 to 1000 and for the first case you need to split them in two partition for 500 rows each.
Create the partitioned table
with identical structure and constraints as the original table
create table jdbn.part
(like jdbn.non_part INCLUDING DEFAULTS INCLUDING CONSTRAINTS)
PARTITION BY RANGE (id);
Add partitions
to cover current data
create table jdbn.part_500 partition of jdbn.part
for values from (1) to (501); /* 1 <= id < 501 */
create table jdbn.part_1000 partition of jdbn.part
for values from (501) to (1001);
for future data (as required)
create table jdbn.part_1500 partition of jdbn.part
for values from (1001) to (1501);
Use insert to copy data
Note that this approach copy the data that means you need twice the space and a possible cleanup of the old data.
insert into jdbn.part (id,name)
select id, name from jdbn.non_part;
Check partition pruning
Note that only the partition part_500 is accessed
EXPLAIN SELECT * FROM jdbn.part WHERE id <= 500;
QUERY PLAN |
----------------------------------------------------------------+
Seq Scan on part_500 part (cost=0.00..14.00 rows=107 width=222)|
Filter: (id <= 500) |
Second Option - MOVE Data to one Partition
If you can live with the one (big) initial partition, you may use the second approach
Create the partitioned table
same as above
Attach the table as a partition
ALTER TABLE jdbn.part ATTACH PARTITION jdbn.non_part
for values from (1) to (1001);
Now the original table gets the first partition of your partitioned table. I.e. no data duplication is performed.
EXPLAIN SELECT * FROM jdbn.part WHERE id <= 500;
QUERY PLAN |
---------------------------------------------------------------+
Seq Scan on non_part part (cost=0.00..18.50 rows=500 width=12)|
Filter: (id <= 500) |
Similar answer with some hints to automation of partition creation here
After trying a few things, the solution was:
INSERT INTO new_table(fields ordered as the result of the select statement) SELECT * FROM old_table
I don't know if there was an easier way to get the fields ordered, but I tried inserting a row in DBEver from these options:
Then got names like these steps:

Duplicates not getting ignored in SQL Server

I have a temp table that has two rows.
Their Id is 999359143, 999365081
I have a table that doesn't have a primary key but has a unique index based off of the id and date.
This 999359143 already exists in the table. So when I use my query it still is trying to insert the row from the temp table into the normal table and it errors. This is the query below
INSERT INTO [XferTable]
([DataDate]
,[LoanNum]
)
SELECT Distinct t1.[DataDate]
,t1.[LoanNum]
FROM #AllXfers t1 WITH(HOLDLOCK)
WHERE NOT EXISTS(SELECT t2.LoanNum, t2.DataDate
FROM XferTable t2 WITH(HOLDLOCK)
WHERE t2.LoanNum = t1.LoanNum AND t2.DataDate = t1.DataDate
)
Is there a better way to do this?
You should use the MERGE statement, which acts atomically so you shouldn't need to do your own locking (also, isolation query hints on temporary tables doesn't achieve anything).
MERGE XferTable AS SOURCE
USING #AllXfers AS TARGET
ON
SOURCE.[DataDate] = TARGET.[DataDate]
AND SOURCE.[LoanNum] = TARGET.[LoanNum]
WHEN NOT MATCHED BY TARGET--record in SOURCE but not in TARGET
THEN INSERT
(
[DataDate]
,[LoanNum]
)
VALUES
(
SOURCE.[DataDate]
,TARGET.[LoanNum]
);
Your primary key violation is probably because you are using (Date, Loan#) as the uniqueness criteria and your primary key is probably only on Loan#.

Using Primary Key on Table Variable to improve seek performance

I have this table variable I use in my SP:
DECLARE #t TABLE(ID uniqueidentifier)
Then I insert some data into it (I later use):
INSERT INTO #t(ID)
SELECT ID FROM Categories WHERE ...
And later I have a few SELECT and UPDATE based on #t IDs e.g.:
SELECT * FROM Categories A INNER JOIN #t T ON A.ID = T.ID
etc..
Should I declare ID uniqueidentifier PRIMARY KEY to increase permanence in the SELECT / UPDATE statements?
If yes should it be clustered or non clustered?
What is the advised option in my case?
EDIT: All my tables in the DB have uniqueidentifier (ID) column as a primary key NONCLUSTERED
EDIT2 : Strangely (or not) when I tried to use PRIMARY KEY NONCLUSTERED on the table variable, when using joined SELECT I see in the execution plan that there is a Table Scan on #t. but when I omit NONCLUSTERED there is a Clustered Index Scan.
If you are worried about performance, you should probably not be using a table variable, but use a temporary table instead. The problem with table variables is that statements referencing it are compiled when the table is empty and therefore the query optimiser always assumes there is only one row. This can result in suboptimal performance when the table variable is populated with many more rows.
Regarding the primary key, there are downsides to make the primary clustered as it will result in the table being physically ordered by the index. The overhead of this sorting operation may outweigh the performance benefit when querying the data. In general it is better to add a non-clustered index, however, as always, it will depend on your particular problem and you will have to test the different implementations.

unique index is not enforced if IsActive column is false [duplicate]

I have a situation where i need to enforce a unique constraint on a set of columns, but only for one value of a column.
So for example I have a table like Table(ID, Name, RecordStatus).
RecordStatus can only have a value 1 or 2 (active or deleted), and I want to create a unique constraint on (ID, RecordStatus) only when RecordStatus = 1, since I don't care if there are multiple deleted records with the same ID.
Apart from writing triggers, can I do that?
I am using SQL Server 2005.
Behold, the filtered index. From the documentation (emphasis mine):
A filtered index is an optimized nonclustered index especially suited to cover queries that select from a well-defined subset of data. It uses a filter predicate to index a portion of rows in the table. A well-designed filtered index can improve query performance as well as reduce index maintenance and storage costs compared with full-table indexes.
And here's an example combining a unique index with a filter predicate:
create unique index MyIndex
on MyTable(ID)
where RecordStatus = 1;
This essentially enforces uniqueness of ID when RecordStatus is 1.
Following the creation of that index, a uniqueness violation will raise an arror:
Msg 2601, Level 14, State 1, Line 13
Cannot insert duplicate key row in object 'dbo.MyTable' with unique index 'MyIndex'. The duplicate key value is (9999).
Note: the filtered index was introduced in SQL Server 2008. For earlier versions of SQL Server, please see this answer.
Add a check constraint like this. The difference is, you'll return false if Status = 1 and Count > 0.
http://msdn.microsoft.com/en-us/library/ms188258.aspx
CREATE TABLE CheckConstraint
(
Id TINYINT,
Name VARCHAR(50),
RecordStatus TINYINT
)
GO
CREATE FUNCTION CheckActiveCount(
#Id INT
) RETURNS INT AS BEGIN
DECLARE #ret INT;
SELECT #ret = COUNT(*) FROM CheckConstraint WHERE Id = #Id AND RecordStatus = 1;
RETURN #ret;
END;
GO
ALTER TABLE CheckConstraint
ADD CONSTRAINT CheckActiveCountConstraint CHECK (NOT (dbo.CheckActiveCount(Id) > 1 AND RecordStatus = 1));
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 2);
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 2);
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 2);
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 1);
INSERT INTO CheckConstraint VALUES (2, 'Oh no!', 1);
INSERT INTO CheckConstraint VALUES (2, 'Oh no!', 2);
-- Msg 547, Level 16, State 0, Line 14
-- The INSERT statement conflicted with the CHECK constraint "CheckActiveCountConstraint". The conflict occurred in database "TestSchema", table "dbo.CheckConstraint".
INSERT INTO CheckConstraint VALUES (2, 'Oh no!', 1);
SELECT * FROM CheckConstraint;
-- Id Name RecordStatus
-- ---- ------------ ------------
-- 1 No Problems 2
-- 1 No Problems 2
-- 1 No Problems 2
-- 1 No Problems 1
-- 2 Oh no! 1
-- 2 Oh no! 2
ALTER TABLE CheckConstraint
DROP CONSTRAINT CheckActiveCountConstraint;
DROP FUNCTION CheckActiveCount;
DROP TABLE CheckConstraint;
You could move the deleted records to a table that lacks the constraint, and perhaps use a view with UNION of the two tables to preserve the appearance of a single table.
You can do this in a really hacky way...
Create an schemabound view on your table.
CREATE VIEW Whatever
SELECT * FROM Table
WHERE RecordStatus = 1
Now create a unique constraint on the view with the fields you want.
One note about schemabound views though, if you change the underlying tables you will have to recreate the view. Plenty of gotchas because of that.
For those still searching for a solution, I came accross a nice answer, to a similar question and I think this can be still useful for many. While moving deleted records to another table may be a better solution, for those who don't want to move the record can use the idea in the linked answer which is as follows.
Set deleted=0 when the record is available/active.
Set deleted=<row_id or some other unique value> when marking the row
as deleted.
If you can't use NULL as a RecordStatus as Bill's suggested, you could combine his idea with a function-based index. Create a function that returns NULL if the RecordStatus is not one of the values you want to consider in your constraint (and the RecordStatus otherwise) and create an index over that.
That'll have the advantage that you don't have to explicitly examine other rows in the table in your constraint, which could cause you performance issues.
I should say I don't know SQL server at all, but I have successfully used this approach in Oracle.
Because, you are going to allow duplicates, a unique constraint will not work. You can create a check constraint for RecordStatus column and a stored procedure for INSERT that checks the existing active records before inserting duplicate IDs.

Resources