Primary Key Validation in Snowflake datawarehouse - snowflake-cloud-data-platform

Is there a primary key validation in Snowflake
If not, how are inserts handled on Snowflake side (do we simply end up with duplicate rows?)
If the answer to 2 is yes, how are subsequent deletes and updates handled against these possible duplicate rows?

1) Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.
https://docs.snowflake.net/manuals/sql-reference/constraints-overview.html
2) Yes, they end up with duplicate rows.
3) You may use WINDOWING functions to specify which duplicate row is deleted:
https://support.snowflake.net/s/question/0D50Z00008EJgemSAD/how-to-delete-duplicate-records-

Unistore (currently in private preview) is Snowflake's OLTP offering, and handles primary key validation automatically (like any traditional OLTP database).

Related

How to make constraints work in Snowflake?

Is there a way for constraints to actually work in Snowflake?
A primary key is created. Still duplicates can be inserted in the table. Giving options like cascade update and delete cascade are not working with Foreign key
Can someone please help?
if you read the Snowflake documentation you will see that only NOT NULL constraints are enforced, all other constraint types are informational only.
I am guessing that the reason for this is that Snowflake is an analytical, rather than an OLTP, database and therefore the expectation is that constraints are enforced in your ELT processes (as is normal practice) rather than in the DB.
Snowflake does not enforce constraints except not null.
Snowflake Notes . I think we cannot enforce a constraint in snowflake database but you can apply the constraint in your ETL tool(if using)

Minimum number of candidate keys for a relation?

My question is, is it necessary for a relation/table in database to have a candidate key and hence a primary key? Is it possible to have a relation where a row cannot be uniquely identified by any combination of attributes?
If no, why? And if yes, then how does a DBMS make operations like search, delete etc, efficient?
Relations always have distinct tuples which means that in a Relational DBMS a table always has at least one candidate key.
SQL is a different case. SQL tables are "tuple bags", not relations. SQL tables can have duplicate rows, which is one of SQL's biggest flaws. Despite the fact that SQL supports duplicate rows the language is ill-suited to cope with them. In the presence of duplicate rows the SQL standard UPDATE and DELETE for instance have no guaranteed way to reference individual rows without resorting to some complex cursor-based operations.
Consequent problems of duplicate rows are certain inefficiencies and complexities of SQL DBMSs and a lack of orthogonality in their features. SQL DBMS engines have to use internal structures and support special features as a prerequisite in order to deal with duplicate rows. Some DBMS vendors try to get around the difficulties by disabling certain features for tables that don't have keys.
A database does not require a primary key. A table is just an unordered set of rows. Without any indexes, the only mechanism for accessing rows in a table is a full table scan (or a full partition scan, if the table is partitioned). Such operations are only efficient for very small numbers of rows.
Tables are more useful when you can refer to particular rows. Often, the best primary keys are auto incremented/identity primary keys. These are maintained by the database. In practice, all tables in a well-designed database are going to have primary keys. Here are three reasons:
Rows can be referred to by other tables.
Individual rows can be updated and deleted.
Individual rows can be selected efficiently and unambiguously.
Note: you can have indexes on a table without primary keys. And combinations of one or more columns can be made unique, even if the combination is not a primary key. The primary key itself is an index, so the inverse is not true. And all rows in a table have "row addresses" which are unique. Whether or not these are available for queries depends on the database engine.
Yes, this is possible.
Just note, that some identifier does exists behind the scenes (Example from SQL Server):
When a table is stored as a heap, individual rows are identified by
reference to a row identifier (RID) consisting of the file number, data page number, and slot on the page
How operations will be performed?
A table scan will be needed for almost any operation:
If a table is a heap and does not have any nonclustered indexes, then
the entire table must be examined (a table scan) to find any row

Do databases use foreign keys transparently?

Do database engines utilize foreign keys transparently or a query should explicitly use them?
Based on my experience there is no explicit notion of foreign keys on a table, except that a constraint that maintains uniqueness of the key and the fact that the key (single or a group of fields) is a key which makes search efficient.
To clarify this, here is an example why it is important: I have a middleware (in particular ArcGIS for my case), for which I can control the back-end database (so I can create keys, indices, etc.) and I usually use the front (a RESTful API here). The middleware itself is a black box and to provide effective tools to take advantage of the underlying DBMS's capabilities. So what I want to understand is that if I build foreign key constraints and use queries that if implemented normally would translate into queries that would use those foreign keys, should I see performance improvements?
Is that generally the case or various engines do it differently? (I am using PostgresSQL).
Foreign keys aren't there to improve performance. They're there to enforce data integrity. They will decrease performance for inserts/updates/deletes, but they make no difference to queries.
Some DBMSs will automatically add an index to the foreign key field, which may be where the confusion is coming from. Postgres does not do this; you'll need to create the index yourself. (And yes, the database will use this index transparently.)
As far as I know Database engines needs specific queries to use foreign keys. You have to write some sort of join queries to get data from related tables.
However some Data access framework hides the complexity of accessing data from foreign keys by providing transparent way of accessing data from related tables but I am not sure that may provide much improvement in performance.
This is completely depends on the database engine.
In PostgreSQL constraints won't cause performance improvements directly, only indexes will do that.
CREATE INDEX is a PostgreSQL language extension. There are no provisions for indexes in the SQL standard.
However, adding some constraints will automatically create an index for that column(s) -- f.ex. UNIQUE & PRIMARY KEY constraints creates a btree index on the affected column(s).
The FOREIGN KEY constraint won't create indexes on the referencing column(s), but:
A foreign key must reference columns that either are a primary key or form a unique constraint. This means that the referenced columns always have an index (the one underlying the primary key or unique constraint); so checks on whether a referencing row has a match will be efficient. Since a DELETE of a row from the referenced table or an UPDATE of a referenced column will require a scan of the referencing table for rows matching the old value, it is often a good idea to index the referencing columns too. Because this is not always needed, and there are many choices available on how to index, declaration of a foreign key constraint does not automatically create an index on the referencing columns.

race condition in logic for database

I have a database server.
The application logic is that it will query to see if a particular row exist, if not, it will insert a new row. The query is done in Java and container managed transactions.
So with 2 application server running the same code, is it possible for both servers to check the row don't exist, and both insert the row. (the insert will be successful due to another unique auto-number primary key column)
how do we ensure there is only one and only one unique row for that data ?
sknaht
the insert will be successful due to another unique auto-number primary key column
Most DBMSes offer a way to create a "unique key" or "unique index" that enforces uniqueness of a given column (or set of columns) even if it's not the primary key. The second insert would then fail, just as if it had violated a primary-key constraint.
You haven't indicated what DBMS you're using, but most (all?) of the common ones have this feature; for example, PostgreSQL, MySQL, SQL Server, and Oracle all do.

How do you determine if a database table relationship merits enforcing referential integrity?

I have an application where the majority of the database tables have a strong relationship to one other table. Currently I am enforcing referential integrity with foreign keys, but I'm wondering if this is really the best approach. Data in the primary table can be deleted from an admin interface by business users, which means having to do a cascading delete, (or writing several delete statements), but I'm not sure if I really want to remove all that other data at the same time. It could be a lot of data that *might* be useful at a later date (reporting maybe?). However, the data in the secondary tables is basically useless to the application itself unless the relationship exists with the primary table.
Given the option, I always keep data around. And since you already have foreign keys in place, you have some built-in protection from integrity violations.
If what your users want is to "delete" a record, therefore hiding it from the application, consider the "virtual delete" strategy -- mark a record as inactive, instead of physically removing it from the database.
As for implementation, depending on your db, add whatever equates to boolean/bit logic for your table. All rows get assigned true/1 by default; "deletes" are marked as false/0.
You can use foreign keys and relationships to enforce referential integrity without having to use cascading deletes. I seldom use cascading deletes as I've always found it's often better to have the data and manage/archive it well than it is to delete it.
Just write your own delete logic to support your own business rules.
Logical deletions work excellently as well and I use them extensively.
You don't want to delete some of the data - you'll likely end up with rogue data, that you have no idea where it belonged in the first place. It's either all or nothing.
Soft delete, i.e. having a bit field on every row that determins if the record is "deleted" or not is the way to go. That way, you simply check if the record is deleted == true in the API, and hide it from the application.
You keep the data, but no one can retrieve it through the application.
I would say use foreign key constraints as a rule - this "safeguards" your DB design long-term, as well as data integrity itself. Constraints are there also to explicitly state a designer's decision.
I've seen constraints ditched on extremely large databases - that would be one reason not to use them, if you compare the performance and there is a significant foreign key overhead.
I'd use logical/soft delete. This basically means adding one more column (possibly bit column Deleted) to the table in question, which would mark a particular row as deleted.
That said, "deleted" data is just that: deleted. Thus it cannot logically be used in reporting and similar stuff. In order to overcome this, I'd also introduce Hidden column to hide certain rows retaining their logical meaning.
Never do physical deletes. You can add a BOOL flag IsDeleted to indicate the record is deleted. When you want to "Delete" a record, simply set the flag to True.

Resources