ScyllaDB: Prevent duplicate entries on specific columns - database

I'm using ScyllaDB and I have a table with the following 5 columns:
K1 K2 V1 V2 V3
Where K1 is the partition key, K2 is the clustering key, V1..V3 are three columns representing 3 different values in the table.
I want to prevent duplicate values from being added to the table where K1, K2, V1 and V2 match an existing entry in table. In other words, it should not be possible to add/store more than one row where ALL 4 columns in this row match an existing row with the same values.
Is this possible to achieve with Scylla?
Thanks

The most reliable way to achieve that is to make all 4 of those columns be part of the primary key of the table. Keys are naturally de-duplicated or better said, a new write with a key value will just overwrite the old value with said key.
You mention that the current schema is something like this (assuming text as type for simplicity):
CREATE TABLE ks.tbl (
K1 text,
K2 text,
V1 text,
V2 text,
V3 text,
PRIMARY KEY(K1, K2)
);
You can change your primary key to be like this: PRIMARY KEY(K1, (K2, V1, V2)).
You will still be able to query based on just K1 and K2, as clustering restrictions allow for only a prefix of the clustering key to be specified.

Related

Cassandra data modeling: choosing partition key and composite key

I am trying data modeling with Cassandra and I am confused on what should I choose as my partition key and composite key. My table looks like below
CREATE TABLE mykeyspace.mytable (
id UUID,
A text,
B text,
C text,
D text,
... other columns
PRIMARY KEY(id)
);
I have introduced an id column in my table and made it as primary key, so that querying with id is faster.
The problem that I am facing is the set of columns (A,B,C,D) uniquely identifies the data, and whenever I perform insertion I want to prevent duplication and searching with the set of column(A,B,C,D) might be expensive since its not part of my primary key.
I am generating the id randomly, one approach that I though was to hash the 4 columns, then it would solve the duplication problem but I skeptical about how the data would be distributed if I start taking the hash for the id.
Other approach that I though of was making (A,B,C,D) as clustering key so that my primary key now looks like ((id), A,B,C,D) and using the clustering key before insertion to prevent duplication, here I am not sure how efficient the searches are just with the clustering key?
Which of the above approach for data modeling is more suitable or is there any other approach?
If your primary concern is data integrity (no dupes), you really have no choice but to make (A, B, C, D) your primary key. As for which subset of those columns to choose as partitioning key, there are several considerations. One of them is that for better scalability you want approximately even distribution of your data among partitions. So if D can only have 2 values, one of them used in 99% of rows, don't make D a sole partitioning column. Another consideration is how you want to query the data. If you want to be able to query by subsets of columns -- for example, query by (A, B, C) and (B, C, D), then your partitioning key choice is limited to either B, or C, or (B, C).

Create Primary key in sql server

I have one table in sql server having duplicate ID,but I can not delete those duplicate records .Now the requirement is to create the primary key on that column which is having duplicate data. Is there any way to create the primary key without changing the data.
No, there is no way to a add a PRIMARY KEY constraint to a column that already has duplicate values.
Creating and Modifying PRIMARY KEY Constraints:
When a PRIMARY KEY constraint is added to an existing column or
columns in the table, the Database Engine examines the existing column
data and metadata to make sure that the following rules for primary
keys:
The columns cannot allow for null values.
There can be no duplicate values.
If a PRIMARY KEY constraint is added to a
column that has duplicate values or allows for null values, the
Database Engine returns an error and does not add the constraint.
In case ID column is incremental, then a possible workaround is to add a unique filtered index:
CREATE UNIQUE INDEX AK_MyUniqueIndex ON dbo.MyTable (ID)
WHERE ID > ... max value of existing ID here
This way, uniqueness will be applied only to newly added records.
I know this is old, but, had this idea that I wanted to share:
Step 1. Add a non-nullable int column with a default value that can
be 0
Optional step. Update that column to a 1, so you are able to
identify this existing records afterwards.
Step 2. Update column in all existing rows where there are duplicates with a standard rownumber() using a combination of unique columns or all columns.
Step 3. Define primary key with your ID column first (So, it is
indexed first), then add Step 1 column.
And you are done and with a special column that can helps identify the duplicates easily and the new records which will be all marked as 0, but the best practice would be to add a character or number to all Ids if possible and standardize (This approach helps to do that afterwards), or use something like by year sequence, etc.

Difference between partition key, composite key and clustering key in Cassandra?

I have been reading articles around the net to understand the differences between the following key types. But it just seems hard for me to grasp. Examples will definitely help make understanding better.
primary key,
partition key,
composite key
clustering key
There is a lot of confusion around this, I will try to make it as simple as possible.
The primary key is a general concept to indicate one or more columns used to retrieve data from a Table.
The primary key may be SIMPLE and even declared inline:
create table stackoverflow_simple (
key text PRIMARY KEY,
data text
);
That means that it is made by a single column.
But the primary key can also be COMPOSITE (aka COMPOUND), generated from more columns.
create table stackoverflow_composite (
key_part_one text,
key_part_two int,
data text,
PRIMARY KEY(key_part_one, key_part_two)
);
In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is the CLUSTERING KEY (in this example key_part_two)
Please note that both partition and clustering key can be made by more columns, here's how:
create table stackoverflow_multiple (
k_part_one text,
k_part_two int,
k_clust_one text,
k_clust_two int,
k_clust_three uuid,
data text,
PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)
);
Behind these names ...
The Partition Key is responsible for data distribution across your nodes.
The Clustering Key is responsible for data sorting within the partition.
The Primary Key is equivalent to the Partition Key in a single-field-key table (i.e. Simple).
The Composite/Compound Key is just any multiple-column key
Further usage information: DATASTAX DOCUMENTATION
Small usage and content examples
***SIMPLE*** KEY:
insert into stackoverflow_simple (key, data) VALUES ('han', 'solo');
select * from stackoverflow_simple where key='han';
table content
key | data
----+------
han | solo
COMPOSITE/COMPOUND KEY can retrieve "wide rows" (i.e. you can query by just the partition key, even if you have clustering keys defined)
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 9, 'football player');
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 10, 'ex-football player');
select * from stackoverflow_composite where key_part_one = 'ronaldo';
table content
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 9 | football player
ronaldo | 10 | ex-football player
But you can query with all keys (both partition and clustering) ...
select * from stackoverflow_composite
where key_part_one = 'ronaldo' and key_part_two = 10;
query output
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 10 | ex-football player
Important note: the partition key is the minimum-specifier needed to perform a query using a where clause.
If you have a composite partition key, like the following
eg: PRIMARY KEY((col1, col2), col10, col4))
You can perform query only by passing at least both col1 and col2, these are the 2 columns that define the partition key. The "general" rule to make query is you must pass at least all partition key columns, then you can add optionally each clustering key in the order they're set.
so, the valid queries are (excluding secondary indexes)
col1 and col2
col1 and col2 and col10
col1 and col2 and col10 and col 4
Invalid:
col1 and col2 and col4
anything that does not contain both col1 and col2
Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.
A primary key uniquely identifies a row.
A composite key is a key formed from multiple columns.
A partition key is the primary lookup to find a set of rows, i.e. a partition.
A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).
Examples:
PRIMARY KEY (a): The partition key is a.
PRIMARY KEY (a, b): The partition key is a, the clustering key is b.
PRIMARY KEY ((a, b)): The composite partition key is (a, b).
PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).
PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.
PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).
In Cassandra, the difference between primary key, partition key, composite key, clustering key always makes some confusion. So, I am going to explain below and co relate to each other's. We use CQL (Cassandra Query Language) for Cassandra database access.
Note: - Answer is as per updated version of Cassandra.
Primary Key: -
In Cassandra there are 2 different ways to use primary Key.
CREATE TABLE Cass (
id int PRIMARY KEY,
name text
);
Create Table Cass (
id int,
name text,
PRIMARY KEY(id)
);
In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns form the partition key.
Partitioning and Clustering
The PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.
CREATE TABLE device_check (
device_id int,
checked_at timestamp,
is_power boolean,
is_locked boolean,
PRIMARY KEY (device_id, checked_at)
);
Here device_id is partition key and checked_at is cluster_key.
We can have multiple cluster keys as well as partition keys too which depends on declaration.
Primary Key: Is composed of partition key(s) [and optional clustering keys(or columns)]
Partition Key: The hash value of Partition key is used to determine the specific node in a cluster to store the data
Clustering Key: Is used to sort the data in each of the partitions (or responsible node and its replicas)
Compound Primary Key: As said above, the clustering keys are optional in a Primary Key. If they aren't mentioned, it's a simple primary key. If clustering keys are mentioned, it's a Compound primary key.
Composite Partition Key: Using just one column as a partition key, might result in wide row issues (depends on use case/data modeling). Hence the partition key is sometimes specified as a combination of more than one column.
Regarding confusion of which one is mandatory, which one can be skipped etc. in a query, trying to imagine Cassandra as a giant HashMap helps. So, in a HashMap, you can't retrieve the values without the Key.
Here, the Partition keys play the role of that key. So, each query needs to have them specified. Without which Cassandra wouldn't know which node to search for.
The clustering keys (columns, which are optional) help in further narrowing your query search after Cassandra finds out the specific node (and its replicas) responsible for that specific Partition key.
In brief sense:
Partition Key is nothing but identification for a row, that identification most of the times is the single column (called Primary Key) sometimes a combination of multiple columns (called Composite Partition Key).
Cluster key is nothing but Indexing & Sorting. Cluster keys depend on few things:
What columns you use in where clause except primary key columns.
If you have very large records then on what concern I can divide the date for easy management. Example, I have data of 1million a county population records. So, for easy management, I cluster data based on state and after pincode and so on.
Worth to note, you will probably use those lots more than in similar concepts in relational world (composite keys).
Example - suppose you have to find last N users who recently joined user group X. How would you do this efficiently given reads are predominant in this case? Like that (from offical Cassandra guide):
CREATE TABLE group_join_dates (
groupname text,
joined timeuuid,
join_date text,
username text,
email text,
age int,
PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)
Here, partitioning key is compound itself and the clustering key is a joined date. The reason why a clustering key is a join date is that results are already sorted (and stored, which makes lookups fast). But why do we use a compound key for partitioning key? Because we always want to read as few partitions as possible. How putting join_date in there helps? Now users from the same group and the same join date will reside in a single partition! This means we will always read as few partitions as possible (first start with the newest, then move to older and so on, rather than jumping between them).
In fact, in extreme cases you would also need to use the hash of a join_date rather than a join_date alone - so that if you query for last 3 days often those share the same hash and therefore are available from same partition!
Disclaimer: This is answer is specific to DynamoDB, however the concepts apply to Cassandra as well, since both are NoSQL databases.
When you create a table, in addition to the table name, you must specify the primary key of the table. The primary key uniquely identifies each item in the table, so that no two items can have the same key.
DynamoDB supports two different kinds of primary keys:
Partition key – A simple primary key, composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
In a table that has only a partition key, no two items can have the same partition key value.
Partition key and sort key – Referred to as a composite primary key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key.
DynamoDB uses the partition key value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. All items with the same partition key value are stored together, in sorted order by sort key value.
In a table that has a partition key and a sort key, it's possible for two items to have the same partition key value. However, those two items must have different sort key values.
A composite primary key gives you additional flexibility when querying data. For example, if you provide only the value for Artist, DynamoDB retrieves all of the songs by that artist. To retrieve only a subset of songs by a particular artist, you can provide a value for Artist along with a range of values for SongTitle.
Note: The partition key of an item is also known as its hash
attribute. The term hash attribute derives from the use of an internal
hash function in DynamoDB that evenly distributes data items across
partitions, based on their partition key values.
The sort key of an item is also known as its range attribute. The term range attribute derives from the way DynamoDB stores items with the same partition key physically close together, in sorted order by the sort key value.
Reference - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey
Primary Key: Like in many databases, it is a unique key in a table, essentially it means, for any two records in a table, primary key cannot be same. Database, in this case Cassandra is designed to make sure that this condition is true in all situations. So if you try to write a record with PK1 as primary key, if there is a record already present with same key PK1, it will get overwritten, else a new record will be created.
Partition Key: It is a construct of distributed databases(where data of a single table is divided into multiple parts called partitions). Partitions are then distributed across nodes using a distribution strategy(usually, hash of partition key) to get infinite scaling capabilities. Having said that, partition key is a set of columns of a record that decides which partition this record will belong to. And hence, partition key decides the physical location of a record across distributed cluster of nodes.
Clustering Key: Clustering Key decides the order of records in a particular partition. So, if there are 10K records in a partition, clustering key will decide the order in which these 10K will be physically stored in a sorted manner.
Example:
Lets say you have a table in Cassandra to store sales event of an e-commerce website.
[order_id, item_id, quantity, amount, payment_id, status, order_time, PRIMARY KEY( (order_id, item_id), order_time)] with clustering ORDER BY (order_time DESC);
So here,
Primary Key is ((order_id, item_id), order_time) and it will decide the uniqueness of a record in table.
Partition Key is (order_id, item_id), hash of this tuple will decide the partition of this record and it location on distributed cluster.
Clustering Key is order_time, for a particular partition, records will be ordered by order_time in descending order. So if you do a Limit 1 cql query for a particular partition, you will get the record with max timestamp always.
Composite key is just a term to specify that Primary key of a table is not single column, but multiple columns.
Primary key is a combination of partition and clustering key.

What's the difference between a Primary Key and Identity?

In a SQL Server db, what is the difference between a Primary Key and an Identity column? A column can be a primary key without being an indentity. A column cannot, however, be an identity without being a primary key.
In addition to the differences, what does a PK and Identity column offer that just a PK column doesn't?
A column can definitely be an identity without being a PK.
An identity is simply an auto-increasing column.
A primary key is the unique column or columns that define the row.
These two are often used together, but there's no requirement that this be so.
This answer is more of WHY identity and primary key than WHAT they are since Joe has answered WHAT correctly above.
An identity is a value your SQL controls. Identity is a row function. It is sequential, either increasing or decreasing in value, at least in SQL Server. It should never be modified and gaps in the value should be ignored. Identity values are very useful in linking table B to table A since the value is never duplicated. The identity is not the best choice for a clustered index in every case. If a table contains audit data the clustered index may be better being created on the date occurred as it will present the answer to the question " what happened between today and four days ago" with less work because the records for the dates are sequential in the data pages.
A primary key makes the column or columns in a row unique. Primary key is a column function. Only one primary key may be defined on any table, but multiple unique indexes may be created which simulates the primary key. Clustering the primary key is not always the correct choice. Consider a phone book. If the phone book is clustered by the primary key(phone number) the query to return the phone numbers on "First Street" will be very costly.
The general rules I follow for identity and primary key are:
Always use an identity column
Create the clustered index on the column or columns which are used in range lookups
Keep the clustered index narrow since the clustered index is added to the end of every other index
Create primary key and unique indexes to reject duplicate values
Narrow keys are better
Create an index for every column or columns used in joins
These are my GENERAL rules.
A primary key (also known as a candidate key) is any set of attributes that have the properties of uniqueness and minimality. That means the key column or columns are constrained to be unique. In other words the DBMS won't permit any two rows to have the same set of values for those attributes.
The IDENTITY property effectively creates an auto-incrementing default value for a column. That column does not have to be unique though, so an IDENTITY column isn't necessarily a key.
However, an IDENTITY column is typically intended to be used as a key and therefore it usually has a uniqueness constraint on it to ensure that duplicates are not permitted.
Major Difference between Primary and Identity Column
Primary Column:
Primary Key cannot have duplicate values.
It creates a clustered index for the Table.
It can be set for any column type.
We need to provide the primary column value while inserting in the table.
Identity Column:
Identity Column can have duplicate value.
It can only be set for Integer related columns like int, bigint, smallint, tinyint or decimal
No need to insert values in the identity column. It is inserted automatically based on the seed.
EDITS MADE BASED ON FEEDBACK
A key is unique to a row. It's a way of identifying a row. Rows may have none, one, or several keys. These keys may consist of one or more columns.
Keys are indexes with a unique constraint. This differentiates them from non-key indexes.
Any index with multi-columns is called a "composite index".
Traditionally, a primary key is viewed as the main key that uniquely identifies a row. There may only be one of these.
Depending on the table's design, one may have no primary key.
A primary key is just that - a "prime key". It's the main one that specifies the unique identity of a row. Depending on a table's design, this can be a misnomer and multiple keys express the uniqueness.
In SQL Server, a primary key may be clustered. This means the remaining columns are attached to this key at the leaf level of the index. In other words, once SQL Server has found the key, it has also found the row (to be clear, this is because of the clustered aspect).
An identity column is simply a method of generating a unique ID for a row.
These two are often used together, but this is not a requirement.
You can use IDENTITY not only with integers, but also with any numeric data type that has a scale of 0
primary key could have scale but its not required.
IDENTITY, combined with a PRIMARY KEY or UNIQUE constraint, lets you provide a simple unique row identifier
Primary key emphasizing on uniqueness and avoid duplication value for all records on the same column, while identity provides increasing numbers in a column without inserting data.
Both features could be on a single column or on difference one.

Unique index or unique key?

What is the diffrence between a unique index and a unique key?
The unique piece is not where the difference lies. The index and key are not the same thing, and are not comparable.
A key is a data column, or several columns, that are forced to be unique with a constraint, either primary key or explicitly defined unique constraint. Whereas an index is a structure for storing data location for faster retrieval.
From the docs:
Unique Index
Creates a unique index on a table or
view. A unique index is one in which
no two rows are permitted to have the
same index key value. A clustered
index on a view must be unique
Unique key (Constraint)
You can use UNIQUE constraints to make
sure that no duplicate values are
entered in specific columns that do
not participate in a primary key.
Although both a UNIQUE constraint and
a PRIMARY KEY constraint enforce
uniqueness, use a UNIQUE constraint
instead of a PRIMARY KEY constraint
when you want to enforce the
uniqueness of a column, or combination
of columns, that is not the primary
key.
This MSDN article comparing the two is what you're after. The terminology is such that "constraint" is ANSI, but in SQL Server you can't disable a Unique Constraint...
For most purposes, there's no difference - the constraint is implemented as an index under the covers. The MSDN article backs this up--the difference is in the meta-data, for things like:
tweaking FILLFACTOR
INCLUDE provides more efficient covering indexes (composite constraint)
A filtered index is like a constraint over a subset of rows/ignore multiple null etc.
"Unique key" is a tautology. A Key (AKA "Candidate Key") is logical feature of the database - a constraint that enforces the uniqueness of a set of attributes in a table.
An index is a physical level feature intended to optimise performance in some way. There are many types of index.
Unique Key: It is a constraint which imposes limitation on database. That limitation is it will not allow duplicate values . For example if you want to select one column as primary key it should be NOT NULL & UNIQUE.
Unique Index: It is a index which improves the performance while executing queries on your data base. In unique index it also not allows duplicate values in index . ie.no two rows will have the same index key value.
Here are few key differences:
Purpose:
Unique Key: Ensures integrity of data at table level, so that no duplicates can be entered in the table. Is not used for query planning, does not contribute to query speed. (It's different purpose than Primary Key, primary key uniquely identifies each record for data operations such as update / delete etc. In complex tables, a unique key can be combinations of several columns and it will be inefficient to use unique key for identifying records for transactions. Hence primary key is quick way of identifying a particular record in the table, while unique key guarantees that no two records have same key attributes.)
Unique Index: Ensures uniqueness of data at index level, cannot guarantee uniqueness at the table level e.g. in case of filtered index. Is used for query planning and fetching data and thus speeds up queries depending on columns used / queried.
Filter Option:
Unique Key: Filter option is not available
Unique Index: Filter option is available
Storage Option:
Unique Key: Filegroup only
Unique Index: Filegroup or partition
Icon:
Unique Key: Icon is vertical key [ ]
Unique Index: Icon is b-tree [ ]
Both the key (aka keyword) and index are identifiers of a table row.
Though index is parallel identification structure, containing a pointer to the identified row, while keys are in situ field members.
The key, as identifier, implies uniqueness (constraint) and NOT NULL (constraint).
There is no sense in NULL as identifier (as null cannot identify anything) as well nonunique identifying value.
Non-clustered index can contain real data, not serving as identifier to real data, and so be non-unique [1]
It is unfortunate practice that the key or index (identifier) is called by constraint (rule or restriction) what most previous answers here followed.
Keys are used in context of:
alternate aka secondary aka candidate keys, can be multiple
composite key (a few fields combined)
primary key (superkey), natural or surrogate key, only one, really used for referential integrity
foreign key
Foreign key is the key in another table (where it is primary key) and even not a key to which they frequently refer. Such use is explained by confusing shortcutting of "foreign key constraint" term to just "foreign key".
Primary key constraint really implies NOT NULL and UNIQUE constraints + that referenced column (or combined columns) is identifier and also unfortunately substituted by "primary key" or "primary key constraint" while it is both which cannot be called either by only (primary key) constraint or by only (primary) key.
Update:
My related question:
[1]
UNIQUE argument for INDEX creation - what's for?
The functionalities are more or less same, it’s dependent on your use case.
Suppose you want to permit duplicate rows based on CUSTOMER_ID and TEAM_NAME.
In that case you can use both:
UNIQUE INDEX idx_customer_id_name (CUSTOMER_ID,TEAM_NAME)
UNIQUE KEY unique_key_customer_id_name (CUSTOMER_ID,TEAM_NAME)
But you should consider how often you fetch records based on CUSTOMER_ID AND TEAM_NAME. If it is more, then you should use unique index as it would help in faster retrieval of records otherwise you should go with unique key as it would prevent overheard of fetching based on index.

Resources