I am trying data modeling with Cassandra and I am confused on what should I choose as my partition key and composite key. My table looks like below
CREATE TABLE mykeyspace.mytable (
id UUID,
A text,
B text,
C text,
D text,
... other columns
PRIMARY KEY(id)
);
I have introduced an id column in my table and made it as primary key, so that querying with id is faster.
The problem that I am facing is the set of columns (A,B,C,D) uniquely identifies the data, and whenever I perform insertion I want to prevent duplication and searching with the set of column(A,B,C,D) might be expensive since its not part of my primary key.
I am generating the id randomly, one approach that I though was to hash the 4 columns, then it would solve the duplication problem but I skeptical about how the data would be distributed if I start taking the hash for the id.
Other approach that I though of was making (A,B,C,D) as clustering key so that my primary key now looks like ((id), A,B,C,D) and using the clustering key before insertion to prevent duplication, here I am not sure how efficient the searches are just with the clustering key?
Which of the above approach for data modeling is more suitable or is there any other approach?
If your primary concern is data integrity (no dupes), you really have no choice but to make (A, B, C, D) your primary key. As for which subset of those columns to choose as partitioning key, there are several considerations. One of them is that for better scalability you want approximately even distribution of your data among partitions. So if D can only have 2 values, one of them used in 99% of rows, don't make D a sole partitioning column. Another consideration is how you want to query the data. If you want to be able to query by subsets of columns -- for example, query by (A, B, C) and (B, C, D), then your partitioning key choice is limited to either B, or C, or (B, C).
Related
I have 2 table with many - many relationships. So I use the third table to mapping 2 table together. And I don't know what should I choose between having an auto-incremented, integer primary key then put unique indexes on the others or use a primary key have many values. What are benefits of each way?
Thank a lot
In theory of Data Modeling, both solutions are correct.
But in practice: It's better to use second solution.
Taking new Auto-Increment ID and set it as Primary-Key and set two transmitted foreign keys as Unique (together).
Advantages:
Redundancy of Data:
Assume that we have two tables named A and B and AB is new created middle table (because of many-to-many relationship).
Now if AB has a new relationship (one-to-many) with C. We should transmit Primary-Key of AB as foreign key to C. So it's better to transform ID instead of two attributes. As the same way if C has a new relationship (one-to-many) with D ... and so on.
Disadvantages:
Access Performance to IDs: Although there is redundancy in first solution, but there is a performance to access IDs without using any JOINs. (Assume that in table C, we want to access of A and B IDs.) But, accessing only to IDs is not much used.
In a Microsoft SQL Server database there is no significant difference between a PRIMARY KEY constraint and a UNIQUE constraint on non-nullable columns. The PRIMARY KEY is essentially just syntactic sugar. Convention and individual preference are the most common reasons for using the PRIMARY KEY constraint.
In any DBMS what really matters is what keys you have and how you use them, not which keys you designate as "primary".
In my system I have temporary entities that are created based on rules stored in my database, and the entities are not persisted.
Now, I need is to store information about these entities, and because they are created based on rules and are not stored, they have no ID.
I came up with a formula to generate an ID for these temp entities based on the rule that was used to generate them: id = rule id + "-" + entity index in the rule. This formula generates unique strings of the form 164-3, 123-0, 432-2, etc...
My question is how should I build my table (regarding primary key and clustered index) when my keys have no relation or order? Keep in mind that I will only (99.9% of the time) query the table using the id mentioned above.
Options I thought about after much reading, but don't have the knowledge to determine which is better:
1) primary key on a varchar column with clustered index. -According to various sources, this would be bad because of fragmentation and the wideness of the key. Also their format is pretty weird for sorting.
2) primary key on varchar column without clustered index (heap table). -Also a bad idea according to various sources due to indexing and fragmentation issues.
3) identity int column with clustered index, and a varchar column as primary key with unique index. -Can't really see the benefit of the surogate key here since it would mainly help with range queries and ordering and I would never query the table based on this key because it would be unknown at all times.
4) 2 columns composite key: rule id + rule index columns.
-Now I don't have strings but I have two columns that will be copied to FKs and non clustered indexes. Also I'm not sure what indexes I would use in this case.
Can anybody shine a light here? Any help is appreciated.
--Edit
I will perform more selects than inserts;
I will perform more inserts than updates;
All selects will include at least rule id;
If I use a surogate primary key, and a unique index on (rule id, index), then I can use the surogate for subsequent operations after retrieving data by rule id, which would be faster. Also, inserts would be faster.
However, because the data will be stored according to the surogate key, I might have records that have the same rule id, but different index, stored quite far from each other on disk, which means even with an index on rule id, retrieving the data could be kinda slow.
If I use (rule id, index) as clustered primary key, rows with same rule id would be stored close to each other, and selecting data by rule id would be efficient enough. However, I suspect inserts would be slow.
Is the rationale above correct?
Using a heap is generally a bad idea unless proven otherwise. Even so, you will need a very solid reason for not having a clustered index (any one will make things better, even on identity column).
Storing this key in a single column is okay; if you want natural sorting, you can pad your numbers with zeroes, for example. However, this will widen the key.
Having a composite primary key (and, subsequently, foreign keys) is completely acceptable, especially when dealing with natural keys, like the one you have. This will give you the narrowest possible key - int + int or some such - while eliminating the sorting issue at the same time. I would recommend to make this PK clustered to reduce additional key lookups.
Fragmentation here will not be a big issue; at least, no bigger than with any other indexing decision. Any index built on such a key will be prone to fragmentation, clustered or no. In any case, your DBA should know how to keep an index such as this in top form.
Regarding the order of columns in the index, the following rules usually apply:
If partial key match will take place (filtering by one part of the key but not by the other) the one which is used most often should go first;
If No.1 isn't applicable and all parts of the key used in all queries, the column with the highest cardinality should go first.
The order of remaining columns (if there are more than 1) isn't of much importance because SQL Server only creates distribution statistics for the first column in a composite index. However, it is a good idea to list them in order of decreasing cardinality.
EDIT: Seeing your update with additional details, here are the most suitable options. Suppose your table looks like this:
-- Sample table
create table dbo.TempEntities (
RuleId int not null,
IndexId int not null,
-- Remaining columns listed here
EntityData xml not null
);
go
From here, the most straightforward way is to use the natural key as a clustered index:
-- Option 1 - natural clustered index
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (RuleId, IndexId);
go
However, if you have any child tables that would reference this one, it might not be the most convenient solution, because natural keys are prone to updates, which creates a mess where you could avoid it. Instead, a surrogate key can be introduced, like this:
-- Option 2 - surrogate clustered, natural nonclustered
alter table dbo.TempEntities add Id bigint identity(1,1) not null;
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (Id);
alter table dbo.TempEntities
add constraint UQ_TempEntities_RuleIdIndexId unique (RuleId, IndexId);
go
It makes sense to have the surrogate PK clustered, because it will result in much less page splits, making inserts faster (despite having one index more compared to Option 1). Without any intimate knowledge of your queries, this is probably the most balanced solution.
Shuffling the clustered attribute between surrogate and natural keys has mostly academic value and can only make difference on a high-load system with hundreds of inserts happening every second on 24*7 schedule. If your system is indeed as such, please seek a professional consultant who will analyse your queries and provide the solution tailored to your situation.
I'm not asking HOW to do this, but if it's what I SHOULD be doing.
Two employees can be working on the same job. So of course, both FKs, EmployeeID and JobID, can have a MANY relationship in a "Employee_Jobs" table.
Let's take Employee A, Employee B, Job A and Job B. All of the following would be acceptable:
A A
A B
B A
B B
What would NOT be acceptable is a duplicate of a combination of these two PKs... since we cannot have for example, [Employee A working on Job A] twice.
So would it be correct to say that the only way to manage this is to make the combination of the two PKs, EmployeeID and JobID, a Unique, non-clustered index?
I tried to think of how to instead, break this up to more tables but I keep getting back to this same problem.
Yes, not only is it appropriate, but in fact, the combination of these two attributes should be the PRIMARY KEY.
and in any other table where the entity represented by rows in the table has a logical attribute (consisting of the two columns employeeId and JobId), which represents the work done by an employee on a job, (or the contribution of the employee to a job, or the association of an employee to a job in any way), a FK in that table should be a composite Foreign Key consisting of these same two columns.
If you are using a surrogate key on this table to simplify joins and definition of Foreign Keys in other tables, then by all means continue to do so, but keep the two-column natural key in this table, as either a unique index or a Alternate Key. (a Key is a Key - anything that is declared or defined to be unique) so as to ensure data integrity in this table. In fact, to make it clear to users of the schema, when this situation comes up, I generally make the composite Natural Key the PRIMARY KEY, and add/define the surrogate (which is used in Joins and Other table FKs), as an alternate key or unique index. This is pretty much only a semantic distinction, only as they create almost identical functionality. But because data integrity is more important to me than join syntax and Foreign Key structure, To me, the Natural Key is the PRIMARY key,
Yes, In that case you should consider making both those fields as primary key; in specific a composite primary key or compound primary key like below which will make sure uniqueness of combination of both the fields.
primary key (EmployeeID , JobID)
Though as you said a Unique, non-clustered index but marking both the field as primary key will create a UNIQUE Clustered Index on them actually.
I have been reading articles around the net to understand the differences between the following key types. But it just seems hard for me to grasp. Examples will definitely help make understanding better.
primary key,
partition key,
composite key
clustering key
There is a lot of confusion around this, I will try to make it as simple as possible.
The primary key is a general concept to indicate one or more columns used to retrieve data from a Table.
The primary key may be SIMPLE and even declared inline:
create table stackoverflow_simple (
key text PRIMARY KEY,
data text
);
That means that it is made by a single column.
But the primary key can also be COMPOSITE (aka COMPOUND), generated from more columns.
create table stackoverflow_composite (
key_part_one text,
key_part_two int,
data text,
PRIMARY KEY(key_part_one, key_part_two)
);
In a situation of COMPOSITE primary key, the "first part" of the key is called PARTITION KEY (in this example key_part_one is the partition key) and the second part of the key is the CLUSTERING KEY (in this example key_part_two)
Please note that both partition and clustering key can be made by more columns, here's how:
create table stackoverflow_multiple (
k_part_one text,
k_part_two int,
k_clust_one text,
k_clust_two int,
k_clust_three uuid,
data text,
PRIMARY KEY((k_part_one, k_part_two), k_clust_one, k_clust_two, k_clust_three)
);
Behind these names ...
The Partition Key is responsible for data distribution across your nodes.
The Clustering Key is responsible for data sorting within the partition.
The Primary Key is equivalent to the Partition Key in a single-field-key table (i.e. Simple).
The Composite/Compound Key is just any multiple-column key
Further usage information: DATASTAX DOCUMENTATION
Small usage and content examples
***SIMPLE*** KEY:
insert into stackoverflow_simple (key, data) VALUES ('han', 'solo');
select * from stackoverflow_simple where key='han';
table content
key | data
----+------
han | solo
COMPOSITE/COMPOUND KEY can retrieve "wide rows" (i.e. you can query by just the partition key, even if you have clustering keys defined)
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 9, 'football player');
insert into stackoverflow_composite (key_part_one, key_part_two, data) VALUES ('ronaldo', 10, 'ex-football player');
select * from stackoverflow_composite where key_part_one = 'ronaldo';
table content
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 9 | football player
ronaldo | 10 | ex-football player
But you can query with all keys (both partition and clustering) ...
select * from stackoverflow_composite
where key_part_one = 'ronaldo' and key_part_two = 10;
query output
key_part_one | key_part_two | data
--------------+--------------+--------------------
ronaldo | 10 | ex-football player
Important note: the partition key is the minimum-specifier needed to perform a query using a where clause.
If you have a composite partition key, like the following
eg: PRIMARY KEY((col1, col2), col10, col4))
You can perform query only by passing at least both col1 and col2, these are the 2 columns that define the partition key. The "general" rule to make query is you must pass at least all partition key columns, then you can add optionally each clustering key in the order they're set.
so, the valid queries are (excluding secondary indexes)
col1 and col2
col1 and col2 and col10
col1 and col2 and col10 and col 4
Invalid:
col1 and col2 and col4
anything that does not contain both col1 and col2
Adding a summary answer as the accepted one is quite long. The terms "row" and "column" are used in the context of CQL, not how Cassandra is actually implemented.
A primary key uniquely identifies a row.
A composite key is a key formed from multiple columns.
A partition key is the primary lookup to find a set of rows, i.e. a partition.
A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).
Examples:
PRIMARY KEY (a): The partition key is a.
PRIMARY KEY (a, b): The partition key is a, the clustering key is b.
PRIMARY KEY ((a, b)): The composite partition key is (a, b).
PRIMARY KEY (a, b, c): The partition key is a, the composite clustering key is (b, c).
PRIMARY KEY ((a, b), c): The composite partition key is (a, b), the clustering key is c.
PRIMARY KEY ((a, b), c, d): The composite partition key is (a, b), the composite clustering key is (c, d).
In Cassandra, the difference between primary key, partition key, composite key, clustering key always makes some confusion. So, I am going to explain below and co relate to each other's. We use CQL (Cassandra Query Language) for Cassandra database access.
Note: - Answer is as per updated version of Cassandra.
Primary Key: -
In Cassandra there are 2 different ways to use primary Key.
CREATE TABLE Cass (
id int PRIMARY KEY,
name text
);
Create Table Cass (
id int,
name text,
PRIMARY KEY(id)
);
In CQL, the order in which columns are defined for the PRIMARY KEY matters. The first column of the key is called the partition key having property that all the rows sharing the same partition key (even across table in fact) are stored on the same physical node. Also, insertion/update/deletion on rows sharing the same partition key for a given table are performed atomically and in isolation. Note that it is possible to have a composite partition key, i.e. a partition key formed of multiple columns, using an extra set of parentheses to define which columns form the partition key.
Partitioning and Clustering
The PRIMARY KEY definition is made up of two parts: the Partition Key and the Clustering Columns. The first part maps to the storage engine row key, while the second is used to group columns in a row.
CREATE TABLE device_check (
device_id int,
checked_at timestamp,
is_power boolean,
is_locked boolean,
PRIMARY KEY (device_id, checked_at)
);
Here device_id is partition key and checked_at is cluster_key.
We can have multiple cluster keys as well as partition keys too which depends on declaration.
Primary Key: Is composed of partition key(s) [and optional clustering keys(or columns)]
Partition Key: The hash value of Partition key is used to determine the specific node in a cluster to store the data
Clustering Key: Is used to sort the data in each of the partitions (or responsible node and its replicas)
Compound Primary Key: As said above, the clustering keys are optional in a Primary Key. If they aren't mentioned, it's a simple primary key. If clustering keys are mentioned, it's a Compound primary key.
Composite Partition Key: Using just one column as a partition key, might result in wide row issues (depends on use case/data modeling). Hence the partition key is sometimes specified as a combination of more than one column.
Regarding confusion of which one is mandatory, which one can be skipped etc. in a query, trying to imagine Cassandra as a giant HashMap helps. So, in a HashMap, you can't retrieve the values without the Key.
Here, the Partition keys play the role of that key. So, each query needs to have them specified. Without which Cassandra wouldn't know which node to search for.
The clustering keys (columns, which are optional) help in further narrowing your query search after Cassandra finds out the specific node (and its replicas) responsible for that specific Partition key.
In brief sense:
Partition Key is nothing but identification for a row, that identification most of the times is the single column (called Primary Key) sometimes a combination of multiple columns (called Composite Partition Key).
Cluster key is nothing but Indexing & Sorting. Cluster keys depend on few things:
What columns you use in where clause except primary key columns.
If you have very large records then on what concern I can divide the date for easy management. Example, I have data of 1million a county population records. So, for easy management, I cluster data based on state and after pincode and so on.
Worth to note, you will probably use those lots more than in similar concepts in relational world (composite keys).
Example - suppose you have to find last N users who recently joined user group X. How would you do this efficiently given reads are predominant in this case? Like that (from offical Cassandra guide):
CREATE TABLE group_join_dates (
groupname text,
joined timeuuid,
join_date text,
username text,
email text,
age int,
PRIMARY KEY ((groupname, join_date), joined)
) WITH CLUSTERING ORDER BY (joined DESC)
Here, partitioning key is compound itself and the clustering key is a joined date. The reason why a clustering key is a join date is that results are already sorted (and stored, which makes lookups fast). But why do we use a compound key for partitioning key? Because we always want to read as few partitions as possible. How putting join_date in there helps? Now users from the same group and the same join date will reside in a single partition! This means we will always read as few partitions as possible (first start with the newest, then move to older and so on, rather than jumping between them).
In fact, in extreme cases you would also need to use the hash of a join_date rather than a join_date alone - so that if you query for last 3 days often those share the same hash and therefore are available from same partition!
Disclaimer: This is answer is specific to DynamoDB, however the concepts apply to Cassandra as well, since both are NoSQL databases.
When you create a table, in addition to the table name, you must specify the primary key of the table. The primary key uniquely identifies each item in the table, so that no two items can have the same key.
DynamoDB supports two different kinds of primary keys:
Partition key – A simple primary key, composed of one attribute known as the partition key.
DynamoDB uses the partition key's value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored.
In a table that has only a partition key, no two items can have the same partition key value.
Partition key and sort key – Referred to as a composite primary key, this type of key is composed of two attributes. The first attribute is the partition key, and the second attribute is the sort key.
DynamoDB uses the partition key value as input to an internal hash function. The output from the hash function determines the partition (physical storage internal to DynamoDB) in which the item will be stored. All items with the same partition key value are stored together, in sorted order by sort key value.
In a table that has a partition key and a sort key, it's possible for two items to have the same partition key value. However, those two items must have different sort key values.
A composite primary key gives you additional flexibility when querying data. For example, if you provide only the value for Artist, DynamoDB retrieves all of the songs by that artist. To retrieve only a subset of songs by a particular artist, you can provide a value for Artist along with a range of values for SongTitle.
Note: The partition key of an item is also known as its hash
attribute. The term hash attribute derives from the use of an internal
hash function in DynamoDB that evenly distributes data items across
partitions, based on their partition key values.
The sort key of an item is also known as its range attribute. The term range attribute derives from the way DynamoDB stores items with the same partition key physically close together, in sorted order by the sort key value.
Reference - https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.CoreComponents.html#HowItWorks.CoreComponents.PrimaryKey
Primary Key: Like in many databases, it is a unique key in a table, essentially it means, for any two records in a table, primary key cannot be same. Database, in this case Cassandra is designed to make sure that this condition is true in all situations. So if you try to write a record with PK1 as primary key, if there is a record already present with same key PK1, it will get overwritten, else a new record will be created.
Partition Key: It is a construct of distributed databases(where data of a single table is divided into multiple parts called partitions). Partitions are then distributed across nodes using a distribution strategy(usually, hash of partition key) to get infinite scaling capabilities. Having said that, partition key is a set of columns of a record that decides which partition this record will belong to. And hence, partition key decides the physical location of a record across distributed cluster of nodes.
Clustering Key: Clustering Key decides the order of records in a particular partition. So, if there are 10K records in a partition, clustering key will decide the order in which these 10K will be physically stored in a sorted manner.
Example:
Lets say you have a table in Cassandra to store sales event of an e-commerce website.
[order_id, item_id, quantity, amount, payment_id, status, order_time, PRIMARY KEY( (order_id, item_id), order_time)] with clustering ORDER BY (order_time DESC);
So here,
Primary Key is ((order_id, item_id), order_time) and it will decide the uniqueness of a record in table.
Partition Key is (order_id, item_id), hash of this tuple will decide the partition of this record and it location on distributed cluster.
Clustering Key is order_time, for a particular partition, records will be ordered by order_time in descending order. So if you do a Limit 1 cql query for a particular partition, you will get the record with max timestamp always.
Composite key is just a term to specify that Primary key of a table is not single column, but multiple columns.
Primary key is a combination of partition and clustering key.
For example, we have table A, and table B which have a many-to-many relationship. An intersection table, Table C stores A.id and B.id along with a value that represents a relationship between the two. Or as a concrete example, imagine stackexchange which has a user account, a forum, and a karma score. Or, a student, a course, and a grade. If table A and B are very large, table C can and probably will grow monstrously large very quickly(in fact lets just assume it does). How do we go about dealing with such an issue? Is there a better way to design the tables to avoid this?
There is no magic. If some rows are connected and some aren't, this information has to be represented somehow, and the "relational" way of doing it is a "junction" (aka "link") table. Yes, a junction table can grow large, but fortunately databases are very capable of handling huge amounts of data.
There are good reasons for using junction table versus comma-separated list (or similar), including:
Efficient querying (through indexing and clustering).
Enforcement of referential integrity.
When designing a junction table, ask the following questions:
Do I need to query in only one direction or both?1
If one direction, just create a composite PRIMARY KEY on both foreign keys (let's call them PARENT_ID and CHILD_ID). Order matters: if you query from parent to children, PK should be: {PARENT_ID, CHILD_ID}.
If both directions, also create a composite index in the opposite order, which is {CHILD_ID, PARENT_ID} in this case.
Is the "extra" data small?
If yes, cluster the table and cover the extra data in the secondary index as necessary.2
I no, don't cluster the table and don't cover the extra data in the secondary index.3
Are there any additional tables for which the junction table acts as a parent?
If yes, consider whether adding a surrogate key might be worthwhile to keep child FKs slim. But beware that if you add a surrogate key, this will probably eliminate the opportunity for clustering.
In many cases, answers to these questions will be: both, yes and no, in which case your table will look similar to this (Oracle syntax below):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR2(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
) ORGANIZATION INDEX COMPRESS;
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID, EXTRA_DATA) COMPRESS;
Considerations:
ORGANIZATION INDEX: Oracle-specific syntax for what most DBMSes call clustering. Other DBMSes have their own syntax and some (MySQL/InnoDB) imply clustering and user cannot turn it off.
COMPRESS: Some DBMSes support leading-edge index compression. Since clustered table is essentially an index, compression can be applied to it as well.
JUNCTION_TABLE_IE1, EXTRA_DATA: Since extra data is covered by the secondary index, DBMS can get it without touching the table when querying in the direction from child to parents. Primary key acts as a clustering key so the extra data is naturally covered when querying from a parent to the children.
Physically, you have just two B-Trees (one is the clustered table and the other is the secondary index) and no table heap at all. This translates to good querying performance (both parent-to-child and child-to-parent directions can be satisfied by a simple index range scan) and fairly small overhead when inserting/deleting rows.
Here is the equivalent MS SQL Server syntax (sans index compression):
CREATE TABLE JUNCTION_TABLE (
PARENT_ID INT,
CHILD_ID INT,
EXTRA_DATA VARCHAR(50),
PRIMARY KEY (PARENT_ID, CHILD_ID),
FOREIGN KEY (PARENT_ID) REFERENCES PARENT_TABLE (PARENT_ID),
FOREIGN KEY (CHILD_ID) REFERENCES CHILD_TABLE (CHILD_ID)
);
CREATE UNIQUE INDEX JUNCTION_TABLE_IE1 ON
JUNCTION_TABLE (CHILD_ID, PARENT_ID) INCLUDE (EXTRA_DATA);
Note that MS SQL Server automatically clusters tables, unless PRIMARY KEY NONCLUSTERED is specified.
1 In other words, do you only need to get "children" of given "parent", or you might also need to get parents of given child.
2 Covering allows the query to be satisfied from the index alone, and avoids expensive double-lookup that would otherwise be necessary when accessing data through a secondary index in the clustered table.
3 This way, the extra data is not repeated (which would be expensive, since it's big), yet you avoid the double-lookup and replace it with (cheaper) table heap access. But, beware of clustering factor that can destroy the performance of range scans in heap-based tables!