How to create primary keys in ClickHouse - database

I did found few examples in the documentation where primary keys are created by passing parameters to ENGINE section.
But I did not found any description about any argument to ENGINE, what it means and how do I create a primary key.
Thanks in advance. It would be great to add this info to the documentation it it's not present.

Primary key is supported for MergeTree storage engines family.
https://clickhouse.tech/docs/en/engines/table_engines/mergetree_family/mergetree/
Note that for most serious tasks, you should use engines from the
MergeTree family.
It is specified as parameters to storage engine.
The engine accepts parameters: the name of a Date type column containing the date, a sampling expression (optional), a tuple that defines the table's primary key, and the index granularity.
Example without sampling support:
MergeTree(EventDate, (CounterID, EventDate), 8192)
Example with sampling support:
MergeTree(EventDate, intHash32(UserID), (CounterID, EventDate, intHash32(UserID)), 8192)
So, (CounterID, EventDate) or (CounterID, EventDate, intHash32(UserID)) is primary key in these examples.
When using ReplicatedMergeTree, there are also two additional parameters, identifying shard and replica.
https://clickhouse.tech/docs/en/engines/table_engines/mergetree_family/replication/#creating-replicated-tables
Primary key is specified on table creation and could not be changed later.
Despite the name, primary key is not unique. It just defines sort order of data to process range queries in optimal way. You could insert many rows with same value of primary key to a table.

Related

Choosing a Column of the type TimeStamp to be a primary key in a SQL Server table

I have a dilemma choosing a primary key between a column which has the type TimeStamp and introducing a surrogate key which should be a identity column.
I have created a small app in which I use .NET SqlBulkCopy class in order to copy data from the source table to the destination table.
The app does not perform well when the primary key is a column of the type TimeStamp and do perform well when the primary key is an identity columns.
This statement is valid until the number of records I have to copy reach some magic number. After that both variants perform well.
I do not know what might be the reason of slow performance for a table with TimeStamp as a primary key?
Any ideas?
I have tried to find any kind of the additional resources about the type TIMESTAMP.
I have found just that the value should increase linearly, and the type occupies 8 bytes.
The type is deprecated ( or not it depends on the documentation ).
I would like to know a logical explanation why the app does not perform well when the PK is of the type TIMESTAMP.

In RDBMS, is the key used for building the index for a database always the primary key?

How do we specify which key is used for building the index for a
database in SQL?
In most if not all RDBMS, is the search key used for building the index for a database always
the primary key?
From Database Management Systems, 3rd Edition, by Raghu
Ramakrishnan, and Johannes Gehrke
In principle, we can use any key, not just the primary
key, to refer to a tuple. However, using the primary key is
preferable because it is what the DBMS expects - this is the
significance of designating a particular candidate key as a
primary key and optimizes for. For example, the DBMS may create
an index with the primary key fields as the search key, to make the
retrieval of a tuple given its primary key value efficient.
Thanks.
That depends on which RDBMS you are using. It will be something like
CREATE INDEX index_name ON table_name(key_name).
YES and NO.
a) If you are creating a table, and generally the RDBMS will create the index for this table using the primary key you specify in your CREATE TABLE statement. If you don't specify a primary key, RDBMS will help you choose an unique and non-null key, OR create an internal key (probably an int type) as primary key for this table.
b) Sometimes, according to the query pattern, you may find some keys other than primary key are used frequently (in where clause for example), then it is good to build new indexes using these keys.
There are two aspects to your question:
I'm trying to interpret your questions. Perhaps what you need to understand is that there can be more than one index into a table?
Let's say you have a Customers table with 3 columns, CustomerID, LastName, and FirstName.
You create an index using a specific CREATE INDEX or ALTER TABLE command where you specify the specific columns you want to have included in the index. You do not have to have the primary key as part of an index; for example, you may create an index on a table of customers by their last name and first name to speed name searches while still having a different primary key like customerID. Here's some SQL-like syntax.
CREATE INDEX customer_name_idx ON Customers(LastName, FirstName)
This index doesn't include any primary keys nor does it require a primary key to function properly. Internally, it will likely point to some internal row IDs that only the DBMS cares about.
I'm trying to understand what you mean here as well.
A DBMS can return a result regardless of the presence of an index; an index just makes it more efficient if your query matches up nicely with an index.
Designating a column as a primary key provides benefits such as enforced uniqueness, and possibly some performance benefits for enforcing other foreign key constraints.
As your quote says though, there is no written rule that says a primary key must also be an index. MySQL, and probably many other DMBSes, creates an index automatically on the table's primary key as it makes sense to do so from a technical level.
Anyway, I hope this makes sense and I hope I can clarify better if you have other questions.

What is the difference between Primary Key and unique key constraint?

What is the difference between Primary key And unique Key constraint?
What's the use of it??
Both are used to denote candidate keys for a table.
You can only have one primary key for a table so would just need to pick one if you have multiple candidates.
Either can be used in Foreign Key constraints. In SQL Server the Primary Key columns cannot be nullable. Columns used in Unique Key constraints can be.
By default in SQL Server the Primary Key will become the clustered index if it is created on a heap but it is by no means mandatory that the PK and clustered index should be the same.
A primary key is one which is used to identify the row in question. It might also have some meaning beyond that (if there was already a piece of "real" data that could serve) or it may be purely an implementation artefact (most IDENTITY columns, and equivalent auto-incremented values on other database systems).
A unique key is a more general case, where a key cannot have repeated values. In most cases people cannot have the same social security numbers in relation to the same jurisdiction (an international case could differ). Hence if we were storing social security numbers, then we would want to model them as unique, as any case of them matching an existing number is clearly wrong. Usernames generally must be unique also, so here's another case. External identifiers (identifiers used by another system, standard or protocol) tend to also be unique, e.g. there is only one language that has a given ISO 639 code, so if we were storing ISO 639 codes we would model that as unique.
This uniqueness can also be across more than one column. For example, in most hierarchical categorisation systems (e.g. a folder structure) no item can have both the same parent item and the same name, though there could be other items with the same parent and different names, and others with the same name and different parents. This multi-column capability is also present on primary keys.
A table may also have more than one unique key. E.g. a user may have both an id number and a username, and both will need to be unique.
Any non-nullable unique key can therefore serve as a primary key. Sometimes primary keys that come from the innate data being modelled are referred to as "natural primary keys", because they are a "natural" part of the data, rather than just an implementation artefact. The decision as to which to use depends on a few things:
Likelihood of change of specification. If we modelled a social security number as unique and then had to adapt to allow for multiple jurisdictions where two or more use a similar enough numbering system to allow for collisions, we likely need just remove the uniqueness constraint (other changes may be needed). If it was our primary key, we now also need to use a new primary key, and change any table that was using that primary key as part of a relationship, and any query that joined on it.
Speed of look-up. Key efficiency can be important, as they are used in many WHERE clauses and (more often) in many JOINs. With JOINS in particular, speed of lookup can be vital. The impact will depend on implementation details, and different databases vary according to how they will handle different datatypes (I would have few qualms from a performance perspective in using a large piece of text as a primary key in Postgres where I could specify the use of hash joins, but I'd be very hesitant to do so in SQLServer [Edit: for "large" I'm thinking of perhaps the size of a username, not something the size of the entire Norse Eddas!]).
Frequency of the key being the only interesting data. For example, with a table of languages, and a table of pieces of comments in that language, very often the only reason I would want to join on the language table when dealing with the comments table is either to obtain the language code or to restrict a query to those with a particular language code. Other information about the language is likely to be much more rarely used. In this case while joining on the code is likely to be less efficient than joining on a numeric id set from an IDENTITY column, having the code as the primary key - and hence as what is stored in the foreign key column on the comments table - will remove the need for any JOIN at all, with a considerable efficiency gain. More often though I want more information from the relevant tables than that, so making the JOIN more efficient is more important.
Primary key:
Primary key is nothing but it uniquely identifies each row in a table.
Primary key does not allow duplicate values, nor NULL.
Primary key by default is a clustered index.
A table can have only one primary key.
Unique Key:
Unique key is nothing but it uniquely identifies each row in a table.
Unique key does not allow duplicate values, but it allows (at most one) NULL.
Unique key by default is a non-clustered index.
This is a fruit full link to understand the Primary Key Database Keys.
Keep in mind we have only one clustered index in a table [Talking about SQL Server 2005].
Now if we want to add another unique column then we will use Unique Key column, because
Unique Key column can be added more than one.
A primary key is just any one candidate key. In principle primary keys are not different from any other candidate key because all keys are equal in the relational model.
SQL however has two different syntax for implementing candidate keys: the PRIMARY KEY constraint and the UNIQUE constraint (on non-nullable columns of course). In practice they achieve exactly the same thing except for the essentially useless restriction that a PRIMARY KEY can only be used once per table whereas a UNIQUE constraint can be used multiple times.
So there is no fundamental "use" for the PRIMARY KEY constraint. It is redundant and could easily be ignored or dropped from the language altogether. However, many people find it convenient to single out one particular key per table as having special significance. There is a very widely observed convention that keys designated with PRIMARY KEY are used for foreign key references, although this is entirely optional.
Short version:
From the point of view of database theory, there is none. Both are simply candidate keys.
In practice, most DMBS like to have one "standard key", which can be used for e.g. deciding how to store data, and to tell tools and DB clients which is the best way to identify a record.
So distinguishing one unique key as the "primary key" is just an implementation convenience (but an important one).

When having an identity column is not a good idea?

In tables where you need only 1 column as the key, and values in that column can be integers, when you shouldn't use an identity field?
To the contrary, in the same table and column, when would you generate manually its values and you wouldn't use an autogenerated value for each record?
I guess that it would be the case when there are lots of inserts and deletes to the table. Am I right? What other situations could be?
If you already settled on the surrogate side of the Great Primary Key Debacle then I can't find a single reason not use use identity keys. The usual alternatives are guids (they have many disadvatages, primarily from size and randomness) and application layer generated keys. But creating a surrogate key in the application layer is a little bit harder than it seems and also does not cover non-application related data access (ie. batch loads, imports, other apps etc). The one special case is distributed applications when guids and even sequential guids may offer a better alternative to site id + identity keys..
I suppose if you are creating a many-to-many linking table, where both fields are foreign keys, you don't need an identity field.
Nowadays I imagine that most ORMs expect there to be an identity field in every table. In general, it is a good practice to provide one.
I'm not sure I understand enough about your context, but I interpret your question to be:
"If I need the database to create a unique column (for whatever reason), when shouldn't it be a monotonically increasing integer (identity) column?"
In those cases, there's no reason to use anything other than the facility provided by the DBMS for the purpose; in your case (SQL Server?) that's an identity.
Except:
If you'll ever need to merge the table with data from another source, use a GUID, which will prevent duplicate keys from colliding.
If you need to merge databases it's a lot easier if you don't have to regenerate keys.
One case of not wanting an identity field would be in a one to one relationship. The secondary table would have as its primary key the same value as the primary table. The only reason to have an identity field in that situation would seem to be to satisfy an ORM.
You cannot (normally) specify values when inserting into identity columns, so for example if the column "id" was specified as an identify the following SQL would fail:
INSERT INTO MyTable (id, name) VALUES (1, 'Smith')
In order to perform this sort of insert you need to have IDENTITY_INSERT on for that table - this is not intended to be on normally and can only be on for a maximum of 1 tables in the database at any point in time.
If I need a surrogate, I would either use an IDENTITY column or a GUID column depending on the need for global uniqueness.
If there is a natural primary key, or the primary key is defined as a unique combination of other foreign keys, then I typically do not have an IDENTITY, nor do I use it as the primary key.
There is an exception, which is snapshot configuration tables which I am tracking with an audit trigger. In this case, there is usually a logical "primary key" (usually date of the snapshot and natural key of the row - like a cost center or gl account number for which the row is a configuration record), but instead of using the natural "primary key" as the primary key, I add an IDENTITY and make that the primary key and make a unique index or constraint on the date and natural key. Although theoretically the date and natural key shouldn't change, in these tables, if a user does that instead of adding a new row and deleting the old row, I want the audit (which reflects a change to a row identified by its primary key) to really reflect a change in the row - not the disappearance of a key and the appearance of a new one.
I recently implemented a Suffix Trie in C# that could index novels, and then allow searches to be done extremely fast, linear to the size of the search string. Part of the requirements (this was a homework assignment) was to use offline storage, so I used MS SQL, and needed a structure to represent a Node in a table.
I ended up with the following structure : NodeID Character ParentID, etc, where the NodeID was a primary key.
I didn't want this to be done as an autoincrementing identity for two main reasons.
How do I get the value of a NodeID after I add it to the database/data table?
I wanted more control when it came to generating my own IDs.

Are there any good reasons to have a database table without an integer primary key?

Although I'm guilty of this crime, it seems to me there can't be any good reason for a table to not have an identity field primary key.
Pros:
- whether you want to or not, you can now uniquely identify every row in your table which previously you could not do
- you can't do sql replication without a primary key on your table
Cons:
- an extra 32 bits for each row of your table
Consider for example the case where you need to store user settings in a table in your database. You have a column for the setting name and a column for the setting value. No primary key is necessary, but having an integer identity column and using it as your primary key seems like a best practice for any table you ever create.
Are there other reasons besides size that every table shouldn't just have an integer identity field?
Sure, an example in a single-database solution is if you have a table of countries, it probably makes more sense to use the ISO 3166-1-alpha-2 country code as the primary key as this is an international standard, and makes queries much more readable (e.g. CountryCode = 'GB' as opposed to CountryCode = 28). A similar argument could be applied to ISO 4217 currency codes.
In a SQL Server database solution using replication, a UNIQUEIDENTIFIER key would make more sense as GUIDs are required for some types of replication (and also make it much easier to avoid key conflicts if there are multiple source databases!).
The most clear example of a table that doesn't need a surrogate key is a many-to-many relation:
CREATE TABLE Authorship (
author_id INT NOT NULL,
book_id INT NOT NULL,
PRIMARY KEY (author_id, book_id),
FOREIGN KEY (author_id) REFERENCES Authors (author_id),
FOREIGN KEY (book_id) REFERENCES Books (book_id)
);
I also prefer a natural key when I design a tagging system:
CREATE TABLE Tags (
tag VARCHAR(20) PRIMARY KEY
);
CREATE TABLE ArticlesTagged (
article_id INT NOT NULL,
tag VARCHAR(20) NOT NULL,
PRIMARY KEY (article_id, tag),
FOREIGN KEY (article_id) REFERENCES Articles (article_id),
FOREIGN KEY (tag) REFERENCES Tags (tag)
);
This has some advantages over using a surrogate "tag_id" key:
You can ensure tags are unique, without adding a superfluous UNIQUE constraint.
You prevent two distinct tags from having the exact same spelling.
Dependent tables that reference the tag already have the tag text; they don't need to join to Tags to get the text.
Every table should have a primary key. It doesn't matter if it's an integer, GUID, or the "setting name" column. The type depends on the requirements of the application. Ideally, if you are going to join the table to another, it would be best to use a GUID or integer as your primary key.
Yes, there are good reasons. You can have semantically meaningful true keys, rather than articificial identity keys. Also, it is not a good idea to have a seperate autoincrementing primary key for a Many-Many table. There are some reasons you might want to choose a GUID.
That being said, I typically use autoincrementing 64bit integers for primary keys.
Every table should have a primary key. But it doesn't need to be a single field identifier. Take for example in a finance system, you may have the primary key on a journal table being the Journal ID and Line No. This will produce a unique combination for each row (and the Journal ID will be a primary key in its own table)
Your primary key needs to be defined on how you are going to link the table to other tables.
I don't think every table needs a primary key. Sometimes you only want to "connect" the contents of two tables - via their primary key.
So you have a table like users and one table like groups (each with primary keys) and you have a third table called users_groups with only two colums (user and group) where users and groups are connected with each other.
For example a row with user = 3 and group = 6 would link the user with primary key 3 to the group with primary key 6.
One reason not to have primary key defined as identity is having primary key defined as GUIDs or populated with externally generated values.
In general, every table that is semantically meaningful by itself should have primary key and such key should have no semantic meaning. A join table that realizes many-to-many relationship is not meaningful by itself and so it doesn't need such primary key (it already has one via its values).
To be a properly normalised table, each row should only have a single identifiable key. Many tables will already have natural keys, such a unique invoice number. I agree, especially with storage being so cheap, there is little overhead in having an autonumber/identity key on all tables, but in this instance which is the real key.
Another area where I personally don't use this approach if for reference data, where typically we have a Description and a Value
Code, Description
'L', 'Live'
'O', 'Old'
'P', 'Pending'
In this situation making code a primary key ensures no duplicates, and is more human readable.
The key difference (sorry) between a natural primary key and a surrogate primary key is that the value of the natural key contains information whereas the value of a surrogate key doesn't.
Why is this important? Well a natural primary key is by definition guaranteed to be unique, but its value is not usually guaranteed to stay the same. When it changes, you have to update it in multiple places.
A surrogate key's value has no real meaning and simply serves to identify that row, so it never needs to be changed. It is a feature of the model rather than the domain itself.
So the only place I would say a surrogate key isn't appropriate is in an association table which only contains columns referring to rows in other tables (most many-to-many relations). The only information this table carries is the association between two (or more) rows, and it already consists solely of surrogate key values. In this case I would choose a composite primary key.
If such a table had bag semantics, or carried additional information about the association, I would add a surrogate key.
A primary key is ALWAYS a good idea. It allows for very fast and easy joining of tables. It aides external tools that can read system tables to make join allowing less skilled people to create their own queries by drag-and-drop. It also makes the implementation of referential integrity a breeze and that is a good idea from the get go.
I know for sure that some very smart people working for web giants do this. While I don't know why their own reasons, I know 2 cases where PK-less tables make sense:
Importing data. The table is temporary. Insertions and whole table scans need to be as fast as possible. Also, we need to accept duplicate records. Later we will clean the data, but the import process needs to work.
Analytics in a DBMS. Identifying a row is not useful - if we need to do it, it is not analytics. We just need a non-relational, redundant, horrible blob that looks like a table. We will build summary tables or materialized views by writing proper SQL queries.
Note that these cases have good reasons to be non-relational. But normally your tables should be relational, so... yes, they need a primary key.

Resources